After error, provide option to try once from checkpoint.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0
Topic 197397

I´m not sure if this is a E@H or boinc wish...

Some tasks checkpoint many times then error out close to the end.

Restarting a task once from a checkpoint might be an option which could reduce errors.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 112

After error, provide option to try once from checkpoint.

BOINC is set up for redundancy. If you run in too many errors for the task to finish in a normal fashion, you'll report it as an error and it'll be sent out to another computer. In the case of a bad batch of tasks, all computers that it gets sent to will give errors, and the administrators will be warned about this in the back-end.

There's really no need for your client to always finish all work correctly.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117519903578
RAC: 35381505

RE: I´m not sure if this

Quote:
I´m not sure if this is a E@H or boinc wish...


Definitely a BOINC thing.

Quote:

Some tasks checkpoint many times then error out close to the end.

Restarting a task once from a checkpoint might be an option which could reduce errors.


It would need to be a certain type of error - in particular errors that do not cause the machine to lock up or crash. Two classic cases I can think of are file(s) that are suddenly missing or file(s) that suddenly fail a checksum. Both of these trash the currently running tasks and also cause the entire remaining cache of work that depend on these file(s) to be trashed as well. It would seem to be a much better option for BOINC simply to stop all crunching temporarily and try to replace the missing file(s) or the corrupt file(s) by downloading fresh copies and then trying again from the last checkpoint.

I have had this situation quite a few times over the years. I've seen a number of cases where supposedly corrupt files are not actually corrupt at all. My impression is that quite a few of these are caused by heat and/or faulty power, again probably related to heat. At the onset of such a problem, it would be helpful if BOINC just stopped crunching rather than trashing the entire cache of work. Surely BOINC could try to replace a file and then stop if there were further problems.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.