FGRP5 (FGRPSSE 1.08) tasks reset progress

carmar
carmar
Joined: 27 May 21
Posts: 32
Credit: 535865
RAC: 506
Topic 231534

Hello.

About 4 hours ago, 2 of these tasks began their run. Just a few minutes ago, I had to reboot, so I suspended my project in BOINC manager and then reboot. I do the suspension first whenever I have to reboot because otherwise my system misbehaves in refusing to start BOINC manager after reboot. This has always worked fine for all BOINC projects, including Einstein.

After reboot, when I started BOINC, I noticed I had lost all progress on these tasks. First time this has happened. Thoughts?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18697425928
RAC: 6216011

Hmmm, I looked at my running

Hmmm, I looked at my running FGRP5 tasks and I don't see any evidence of any checkpointing.

Are you sure they previously checkpointed?

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7218624931
RAC: 984209

You'll normally lose progress

You'll normally lose progress back to the most recent checkpoint.  Checkpoint spacing depends on the project and the app, and perhaps some on the WU. 

I normally monitor with the add-on application BoincTasks, which shows most recent checkpoint.  Don't know where to find that information otherwise, but someone else may come by and inform us both.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18697425928
RAC: 6216011

You have to look at the work

You have to look at the work units properties in the Manager which shows you the time since last checkpoint.

But to be absolutely certain you should examine the slot that the task is running in and see if any checkpoint file is present.

Task checkpointing is dependent on the application if it has that feature. 

 

carmar
carmar
Joined: 27 May 21
Posts: 32
Credit: 535865
RAC: 506

Thanks, all. I selected each

Thanks, all. I selected each one and it appears that neither of them checkpoint. CPU time = time since last checkpoint = total run time. 
 

Just yesterday afternoon I rebooted while it was in the final 89.9% stage of some earlier tasks (where it shows no progress until it suddenly completes) and it even saved the progress there because it completed those tasks in the typical time after I reboot. So, it must have checkpoints yesterday but it’s not doing so today.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18697425928
RAC: 6216011

I'd say that your unusual

I'd say that your unusual need to suspend the project before rebooting is the issue and is disrupting the normal checkpointing mechanism.

I'm curious whether this behavior has previously been reported as a bug in the Boinc development Github account as an issue.  Might want to check.

https://github.com/BOINC/boinc/issues

 

This closed issue comes closest to describing your problem.  But it was closed with no action because discussed changes would be viewed unfavorably by the majority of Boinc users.

https://github.com/BOINC/boinc/issues/4748

 

carmar
carmar
Joined: 27 May 21
Posts: 32
Credit: 535865
RAC: 506

Thanks. As you all have

Thanks. As you all have noted, the checkpoint request from BOINC is just a request. I set the value really high to see if it changes anything and I’ll try really low as well. Either way, if the request is ignored, that tells me nothing. But given that it checkpointed before, you seem to have nailed it that something screwed up on my machine.

PS - the enthusiasm on this thread was mildly amusing but I did appreciate the critique of the logic: https://github.com/BOINC/boinc/issues/5106

Update 1 - I suspended (did not reboot) and resumed. It saved progress although under task properties it still shows CPU time since last checkpoint is the same as total run time. Will monitor.

Update 2 - Yep, shutting down is the problem. Removed project. Will reinstall tomorrow and see if that fixes it.

carmar
carmar
Joined: 27 May 21
Posts: 32
Credit: 535865
RAC: 506

Reinstalled Einstein. Still

Reinstalled Einstein. Still the same issue. Tried with a different project (Milkyway), no issue.

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 23
Credit: 728939973
RAC: 371638

The Milkyway N-Body

The Milkyway N-Body applications have many more opportunities where checkpointing can occur :-)

It appears that this Einstein dataset has 20 "skypoints" and as far as I know it can only checkpoint when it has completed processing of one of those...  As your system seems to take about 11 hours to run one of these tasks (including the second pass over some of the data that happens around the 90% mark) that means there won't be an  initial checkpoint for at least half an hour.

So it might be that whether you get a checkpoint or not depends on how long you let a task run for before halting it! -- I noticed that one of your tasks did checkpoint and resume after skypoint 7, so it seems to work when it has a chance..

Cheers - Al.

P.S.  I'm sure one of the experts will be along to correct me about this :-)

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18697425928
RAC: 6216011

Al, your summation is

Al, your summation is correct.  The tasks do checkpoint when given the chance.  Best to not abort any task prematurely. 

 

carmar
carmar
Joined: 27 May 21
Posts: 32
Credit: 535865
RAC: 506

Thank you both.

Thank you both.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.