I´m not sure if this is a E@H or boinc wish...
Some tasks checkpoint many times then error out close to the end.
Restarting a task once from a checkpoint might be an option which could reduce errors.
Copyright © 2024 Einstein@Home. All rights reserved.
After error, provide option to try once from checkpoint.
)
BOINC is set up for redundancy. If you run in too many errors for the task to finish in a normal fashion, you'll report it as an error and it'll be sent out to another computer. In the case of a bad batch of tasks, all computers that it gets sent to will give errors, and the administrators will be warned about this in the back-end.
There's really no need for your client to always finish all work correctly.
RE: I´m not sure if this
)
Definitely a BOINC thing.
It would need to be a certain type of error - in particular errors that do not cause the machine to lock up or crash. Two classic cases I can think of are file(s) that are suddenly missing or file(s) that suddenly fail a checksum. Both of these trash the currently running tasks and also cause the entire remaining cache of work that depend on these file(s) to be trashed as well. It would seem to be a much better option for BOINC simply to stop all crunching temporarily and try to replace the missing file(s) or the corrupt file(s) by downloading fresh copies and then trying again from the last checkpoint.
I have had this situation quite a few times over the years. I've seen a number of cases where supposedly corrupt files are not actually corrupt at all. My impression is that quite a few of these are caused by heat and/or faulty power, again probably related to heat. At the onset of such a problem, it would be helpful if BOINC just stopped crunching rather than trashing the entire cache of work. Surely BOINC could try to replace a file and then stop if there were further problems.
Cheers,
Gary.