Some projects don't create checkpoint

djejoine

Joined: 27 Feb 23

Posts: 7

Credit: 1557326

RAC: 0

27 Feb 2023 20:01:31 UTC

Topic 229156

(moderation:

)

Hi,

I've noticed that projects like Gamma-ray pulsar and Multi-Directional Gravitational Wave was starting at the beginning everytime i restart my ccomputer. But it works fine with Binary Radio Pulsar (CPU and GPU).

After forcing chekpoint_debug i can see that only Binary Radio Pulsar work units are saved.

27/02/2023 20:59:05 | Einstein@Home | [checkpoint] result guppi_57763_GBT820-01233_0003_0001_dms_3752_2056_0 checkpointed

27/02/2023 20:59:05 | Einstein@Home | [checkpoint] result Ter5_3_cfbf00037_segment_16_dms_400_13200_340_1050000_0 checkpointed

Have i missed something?

Thanks

Keith Myers

Joined: 11 Feb 11

Posts: 5020

Credit: 18921282712

RAC: 6508548

I don't have any current

27 Feb 2023 21:17:52 UTC

Message 208874

(moderation:

)

I don't have any current examples but I believe when I ran the stock FGRPB1G application for Gamma-Ray tasks, I believe they checkpointed. I remember in the stderr.txt task output a series of "C" printed out for every checkpoint.

I could wrong in remembrance too. Somebody else needs to confirm.

As to a task not checkpointing, that all depends on the application developer who wrote the application whether it checkpoints or not. Checkpointing is not controlled by BOINC or the project, only the application involved controls that.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118330822444

RAC: 25406295

djejoine wrote:I've noticed

27 Feb 2023 21:29:17 UTC

Message 208875

(moderation:

)

djejoine wrote:

I've noticed that projects like Gamma-ray pulsar and Multi-Directional Gravitational Wave was starting at the beginning everytime i restart my ccomputer.

Hi djejoine, Welcome to Einstein!

The apps for both of those searches do create checkpoints, as far as I'm aware. I don't run CPU searches (years ago I did) so I don't have recent experience but I would be very surprised if anything has changed. With GRP, the frequency of checkpoint creation can be quite variable (depends on the number of 'sky-points' being analysed) but I'm sure checkpoints are created.

Your computers are hidden so I can't check your returned tasks directly to show you how to see this for yourself. If you tell me your host ID (or change your preferences to allow others to view your computers) I will check to see what is happening.

Cheers,
Gary.

djejoine

Joined: 27 Feb 23

Posts: 7

Credit: 1557326

RAC: 0

Thanks to you both. I've

27 Feb 2023 22:26:40 UTC

Message 208881

(moderation:

)

Thanks to you both.

I've put my computer visible to all in my profile. I've just started Einstein early this morning but i'm very familiar with WCG.

It's the first time that after 3 hours of computing i see a WU restart completely after a computer regular halt. These are very big wu (about 30hours for grp and 24h for gw). If there's no checkpoint in a 3h session i won't be able to finish them in time. BOINC is set to checkpoint every 60s and i forced to 10 with no effect for theses projects.

So for now i have aborted all of them and kept only the BRP ones. I will enable them again if it's needed for testing.

Keith Myers

Joined: 11 Feb 11

Posts: 5020

Credit: 18921282712

RAC: 6508548

Since you aborted all your

27 Feb 2023 23:18:34 UTC

Message 208885 in response to message 208881

(moderation:

)

Since you aborted all your GRP and O3 tasks we can't tell if they are checkpointing or not.

And you need to ignore the estimated times for completion on any application's tasks that haven't returned 10 valid tasks.

ONLY then is the estimated time for completion accurate. When you start a brand new application, the client has no clue about how long a task will take and very often, almost guaranteed, will produce a nonsense value for time to completion.

The APR for the new tasks can't be calculated until you have validated 10 tasks for each new application.

You should get some tasks for GRP first and just let them run until completion and let them report. Turn in 10 validated tasks and then try some of the O3 tasks which are harder to run and need more system resources.

djejoine

Joined: 27 Feb 23

Posts: 7

Credit: 1557326

RAC: 0

Thanks for the advice. I'll

28 Feb 2023 20:21:48 UTC

Message 208941

(moderation:

)

Thanks for the advice. I'll try that when i'm sure that my computer will be on long enough.

The problem is more that the checkpoint must be too long away for my use of my computer. Generaly it is on for a period of 2-4 hours. And at the begginning i had 12 threads used by thoses wu.

After a bit more of 3 hours of crunching numbers (the most advance wu was about 10%), i stopped my computer normally because i was done. When i started my computer back on, the CPU Time was back at 0:00:00:00 for each of them. So there was no checkpoint between 0 and 3h.

I've spend (3+1)h x 12 for nothing. The project didn't get anything of the 48h of calculation, so i aborted all of them because it's a waist of time for everybody.

I'll post back here when i have time to test when the checkpoint happen for those units

Scrooge McDuck

Joined: 2 May 07

Posts: 1077

Credit: 18244286

RAC: 11685

There's a thread in "WISH

28 Feb 2023 20:37:17 UTC

Message 208944

(moderation:

)

There's a thread in "WISH LIST" forum which also discusses very long times between checkpoints (depends on CPU speed and CPU throttle configuration), how to check it and reasons for this. It occurs for all O3MD1 (Multi-Directional Graviational Wave) CPU tasks and specific FGRP5 (Gamma Ray Pulsar search) CPU tasks containing few skypoints. It's not a "problem" but a tradeoff which the developers made between checkpointing more frequently and the required programming effort to nest checkpointing code deep into algorithms.... which also adds computation overhead.

https://einsteinathome.org/de/content/cpu-time-checkpoint-4h

Keith Myers

Joined: 11 Feb 11

Posts: 5020

Credit: 18921282712

RAC: 6508548

As I mention in the thread

28 Feb 2023 21:13:30 UTC

Message 208949 in response to message 208941

(moderation:

)

As I mention in the thread linked, the other option instead of physically powering off your computers mid-calculation for long-running tasks that may or may not have checkpointed is to set the client configuration to leave non-gpu tasks in memory while suspended. You control that in the Manager >> Options >> Computing Preferences >> Disk and Memory settings page.

Then set up your computer for hibernate mode. In hibernate mode, when you stop your work on the computer, the state of the machine is saved to a hibernate disk file. The PC then goes into ultra-low power-saving mode.

When you "wake up" the PC, the host replays the state of the host when it went into hibernate mode. That way your tasks don't restart from zero since BOINC was never shut down.

Scrooge McDuck

Joined: 2 May 07

Posts: 1077

Credit: 18244286

RAC: 11685

djejoine schrieb:After a

28 Feb 2023 21:36:20 UTC

Message 208953 in response to message 208941

(moderation:

)

djejoine wrote:

After a bit more of 3 hours of crunching numbers (the most advance wu was about 10%), i stopped my computer normally because i was done. When i started my computer back on, the CPU Time was back at 0:00:00:00 for each of them. So there was no checkpoint between 0 and 3h.

Currently all FGRP5 tasks only contain 6 (SIX) skypoints, checkpointing six times between 0 and 90% progress: 15%, 30%, 45%... ~90% (and additional 10 checkpoints for final candidate toplist calculation from ~90-100%. You can easily predict the next time it will checkpoint by current progress value. (see also task detail in BOINC manager: time since last checkpoint).

For the O3MD1 tasks it's harder to predict. They contain between 32 and 64 skypoints after which a checkpoint is written. These tasks log a number of dots "....." (e.g.: 21 dots for a task containing 32 skypoints) for each skypoint and finally a "c" (when the next checkpoint was written) into the logfile "stderr.txt" in the task's slot directory. This can be monitored (via 'tail -f' or 'cat' within a bash shell). Or via 'type' in a Windows/DOS command prompt. I think it's no good idea to open the logfile in a text editor which may blocks write access for the science app.

I think there's also a (I'd call it so) annoying "bug" in the current version of the O3MD1 CPU app. At the beginning of a task when processing the first skypoint it reports a wrong and fastly increasing progress (I think by "boinc_task_state.xml" in slot directory). Progress rises up to circa 12..15 % (haven't looked into that in detail until now). As soon as the first skypoint was finished and the first checkpoint was written, the tasks's progress jumps back to the true value of circa ~3 % (example for a task containing 32 skypoints). But this "bug" is only about wrong progress display. Tasks finish up to 100 % without problems... requires more than 30h CPU time on my old laptop. O3MD1 CPU tasks checkpoint rarely. If the CPU runs throttled (BOINC's CPU throttle configuration or via external tools limiting CPU temperature resp. fan noise like "TThrottle") it easily takes hours between checkpoints. Then one can also limit the numbers of concurrently processed tasks (BOINC configuration: proportion of CPU cores to be used), to process fewer tasks concurrently but faster.

djejoine wrote:

I've spend (3+1)h x 12 for nothing. The project didn't get anything of the 48h of calculation, so i aborted all of them because it's a waist of time for everybody.

I'll post back here when i have time to test when the checkpoint happen for those units

[EDIT:] Oh.. Keith already explained how to hibernate tasks.

djejoine

Joined: 27 Feb 23

Posts: 7

Credit: 1557326

RAC: 0

Scrooge McDuck

28 Feb 2023 23:02:54 UTC

Message 208960 in response to message 208953

(moderation:

)

Scrooge McDuck wrote:

There's a thread in "WISH LIST" forum ... overhead.

https://einsteinathome.org/de/content/cpu-time-checkpoint-4h

Thanks that was very instructive.

Keith Myers wrote:

As I mention in the thread linked, the other option instead of physically powering off your computers mid-calculation for long-running tasks that may or may not have checkpointed is to set the client configuration to leave non-gpu tasks in memory while suspended. You control that in the Manager >> Options >> Computing Preferences >> Disk and Memory settings page.

Then set up your computer for hibernate mode. In hibernate mode, when you stop your work on the computer, the state of the machine is saved to a hibernate disk file. The PC then goes into ultra-low power-saving mode.

When you "wake up" the PC, the host replays the state of the host when it went into hibernate mode. That way your tasks don't restart from zero since BOINC was never shut down.

Thanks i'll try this solution, option was already checked but I had never hibernated my computer because never needed to suspend to ram. There's a first for every thing.

I'll take a screenshot tonight before hibernating the computer and check tomorrow. But I'm sure it's will work.

Scrooge McDuck wrote:

Currently all FGRP5 tasks only contain 6 (SIX) skypoints, checkpointing six times between 0 and 90% progress: 15%, 30%, 45%... ~90% (and additional 10 checkpoints for final candidate toplist calculation from ~90-100%. You can easily predict the next time it will checkpoint by current progress value. (see also task detail in BOINC manager: time since last checkpoint).

For the O3MD1 tasks it's harder to predict. They contain between 32 and 64 skypoints after which a checkpoint is written. These tasks log a number of dots "....." (e.g.: 21 dots for a task containing 32 skypoints) for each skypoint and finally a "c" (when the next checkpoint was written) into the logfile "stderr.txt" in the task's slot directory. This can be monitored (via 'tail -f' or 'cat' within a bash shell). Or via 'type' in a Windows/DOS command prompt. I think it's no good idea to open the logfile in a text editor which may blocks write access for the science app.

Thanks for sharing this details. And i directly use boinc to check the log file (I think it uses an equivalent of tail)

Scrooge McDuck wrote:

I think there's also a (I'd call it so) annoying "bug" in the current version of the O3MD1 CPU app. At the beginning of a task when processing the first skypoint it reports a wrong and fastly increasing progress (I think by "boinc_task_state.xml" in slot directory). Progress rises up to circa 12..15 % (haven't looked into that in detail until now). As soon as the first skypoint was finished and the first checkpoint was written, the tasks's progress jumps back to the true value of circa ~3 % (example for a task containing 32 skypoints). But this "bug" is only about wrong progress display. Tasks finish up to 100 % without problems... requires more than 30h CPU time on my old laptop. O3MD1 CPU tasks checkpoint rarely. If the CPU runs throttled (BOINC's CPU throttle configuration or via external tools limiting CPU temperature resp. fan noise like "TThrottle") it easily takes hours between checkpoints. Then one can also limit the numbers of concurrently processed tasks (BOINC configuration: proportion of CPU cores to be used), to process fewer tasks concurrently but faster.

I'll check for that but in my case, for now, it's was the CPU Time which got back to 00:00:00 (and the progress also, of course).

Regarding Temps, i'm ok here, i still have some headroom, each core is under 70C (5700X), so no problem here.

Finally BRP (pure cpu) takes between 8 and 13 hours for what i've seen today.

Thanks All

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118330822444

RAC: 25406295

@djejoine,I started the

28 Feb 2023 23:59:00 UTC

Message 208963 in response to message 208881

(moderation:

)

@djejoine,

I started the following reply (included below) shortly after the message from you that the included quotes refer to. Unfortunately, before I could complete it, I had to deal urgently with another matter and it's only now that I can catch up with all the subsequent discussion. The reply is basically as it was when I got called away. I would normally try to polish it up a bit more.

I considered just deleting it but after reading all the other replies, it might be useful to you (or any others reading) to see an example of how to dissect the stderr.txt output that gets returned with a task, even those with errors or those aborted. It does show how to calculate the approximate checkpoint interval. It shows why this task wasn't able to create a single checkpoint. It certainly over-explains things but my intention was to ensure that even people with no experience, who might also be reading, have a greater chance of following the discussion.

===============

djejoine wrote:

I've put my computer visible ...

Thanks

djejoine wrote:

These are very big wu (about 30hours for grp and 24h for gw).

You can't know the true crunch time until you have completed a task. Most tasks of the same type will tend to take the same amount of time and it could be quite different from an initial estimate. After considering your hardware, my guess is that O3MD1 tasks may take around 10-15 hrs each. A lot depends on how many threads you run and what other compute intensive stuff may be running along with BOINC.

djejoine wrote:

BOINC is set to checkpoint every 60s and i forced to 10 with no effect for theses projects.

You cannot force a task to checkpoint any faster than the data analysis allows. You should interpret the default BOINC setting of 60s as a limit, below which the app will not be allowed to create one, even if it wants to. Since the O3MD1 app will try to checkpoint after probably quite a number of minutes, setting 10s will have no effect whatsoever. I'll show you later how to estimate the checkpoint interval. You can't make it more often than that.

If you want to understand why no progress was being made the best way (even for aborted tasks) is to browse through the output that was returned to the project. I have looked at the O3MD1 tasks that you had and selected this one since it had a decent run time of 1104 secs before being aborted. You could look at others if you wished.

After clicking the link, find the heading Stderr output and a few lines below that look for the line:-

putenv 'LAL_DEBUG_LEVEL=3'

For O3MD1 tasks, I believe you will always see exactly this whenever a task first starts OR is restarted after a shutdown. Notice the timestamp on the next line - 2023-02-27 20:36:45.4277. After about 40 more lines of initialisation output, there is one which says:-

 ... INFO: No checkpoint checkpoint.cpt found - starting from scratch

followed by:-

2023-02-27 20:37:17.8515 (4752) [normal]: Cpt:0,  total:29,  sky:1/1,  f1dot:1/29

The first line would indicate a checkpoint number if one was found, otherwise 'scratch'. The second line tells you that this is 0 of 29 checkpoints for this task. The timestamp shows that nearly a minute was used in the initialisation process that followed the original 'putenv' line.

There is then a 2-line list of various parameters followed by:-

.....putenv 'LAL_DEBUG_LEVEL=3'

Note that there are two things mixed here, a line of 5 dots and a 'putenv' startup string. The 5 dots indicate that there were 5 sub-loops of calculations towards the very first checkpoint when BOINC was stopped and some time later, restarted. There is a timestamp on the next line (2023-02-27 20:46:56.4279). This is less than 10 mins after crunching first started and it represents the sum of any run time plus the time it took to stop and restart. The important takeaway from this is that a checkpoint would have been created if the task had been allowed to run for a bit longer. It was stopped at less than 10 mins. It didn't crash.

If you go past the next set of initialisation lines (again ~40 lines), you will find a set of 4 dots as the task was crunching and then a series of dashes whose purpose was to separate the log from an error message that follows. My guess is that the error message (and all that followed) was a cry from the app when it realised it was being aborted :-). Once again, the timestamps will tell you how long the task had been running to produce those 4 dots.

If you allow a task to run to completion and if that actually takes just 10 hours (say) and it has 29 checkpoints, an estimate for the checkpoint interval would be 600/29 = ~20 mins. If it took 15 hrs, the estimate would be ~30 mins. So this is why no checkpoint was created in the above example - not enough run time was ever allowed. A checkpoint would show as a much longer line of dots followed by a 'C' to indicate a checkpoint saved. A new line of dots would then be started. At the very end of the run (if it was allowed to complete without being stopped) there would be 29 lines of dots, each terminated with a 'C'.

A final comment about the expected run time. It will be slowed down if your machine becomes overloaded so you need to experiment on how many simultaneous threads you allow BOINC to start. You haven't indicated if you run other compute intensive apps apart from BOINC stuff. You also have a high performance GPU. Do you use that for gaming? I noticed some MeerKAT (BRP) tasks you must have run on that. The run times should be fairly stable but I saw a low of 483s and a high of 7,478s so you're obviously running something else on the GPU to cause such a huge variation. That GPU should be able to give run times consistently towards the low end of the range.

==================

A further thought after re-reading the above ( I don't do CPU tasks so can't easily check):

The 29 checkpoints figure that was stated may well actually be 19 + 10 since (I think) the main calculations create a 'toplist' of the ten most likely candidate signals and 10 of the total checkpoints might be assigned to the retesting of each candidate in the followup stage before results are returned. If so, the checkpoint interval would be longer than what was estimated above. You should be able to tell by looking at the stderr output for a completed task to see if there are 10 lines of dots which are of a much shorter length than the first 19. If this is so, the checkpoint interval for a 10 hr task would be around 600/19 = ~32 mins since the followup stage is usually relatively short and can probably be ignored.

I hope someone gets some use from the above :-).

Cheers,
Gary.

Some projects don't create checkpoint

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports