Had a bunch of tasks fail due to MD5 sum errors

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3567470928
RAC: 410818
Topic 196669

These tasks were DLed over a several day period and collectively represent a most of the cache on one of my boxes.

ex
http://einsteinathome.org/task/325234132
http://einsteinathome.org/task/325154234
http://einsteinathome.org/task/323347142

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118638810592
RAC: 18441708

Had a bunch of tasks fail due to MD5 sum errors

Quote:
These tasks were DLed over a several day period and collectively represent a most of the cache on one of my boxes....


You have experienced a problem I see quite a lot at times - particularly now that it's full on summer here :-).

In my experience (dozens of times), the file that is stated to be corrupt is not always so. In many cases, the supposedly corrupt file and a freshly downloaded copy actually compare to be identical. It happens to me about once a week during summer and quite rarely during winter. There are several things I've noticed as being likely culprits and there may well be others - heat, flaky RAM, slow running CPU fans, and swollen motherboard capacitors.

This problem has happened to me so many times that I have developed a technique for recovering the entire cache of trashed tasks with a little bit of state file hackery. When the problem happens and the bulk of the cache is trashed, the client goes into a 24 hour backoff with all the errored tasks just sitting there. Sometimes there are a few tasks which don't actually depend on the 'corrupt' data file and these will continue crunching. So, as long as I notice the problem within the 24 hour window, I have every chance of a full recovery.

The problem sometimes seems to be associated with the completion of a task and the starting of a new one. I assume this is when an MD5 check is done on the data files to be used for the new task. Curiously, the task just completed often has relied on the same data file and this is what caused me to question whether the file was really corrupt or not. The just completed task may also be marked as a 'compute error' and on several occasions (being unconvinced of any problem with the data file) I've actually edited the contents of the state file to convince BOINC that the result was good and to return it as such. Each time I've done this, the result has been accepted and validated, confirming my suspicions.

In the southern winter recently departed, I don't remember any examples of this problem. We've had some hot days lately (39C) and I've had this problem three times in the last couple of weeks. I've already shut down a number of hosts for the summer and a bunch more will be off shortly.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.