Last night one of my computers suffered a (apparent) bitrot event that took out the S6LV application. As a result all of my S6LV tasks failed. Until I triggered a manual update however, boinc didn't attempt to redownload the damaged executable.
10/10/2012 3:44:16 AM | Einstein@Home | [error] Signature verification failed for einstein_S6LV1_1.13_windows_intelx86__SSE2.exe
10/10/2012 3:44:17 AM | Einstein@Home | Computation for task h1_0381.65_S6GC1__779_S6LV1B_0 finished
10/10/2012 3:44:17 AM | Einstein@Home | Output file h1_0381.65_S6GC1__779_S6LV1B_0_0 for task h1_0381.65_S6GC1__779_S6LV1B_0 absent
10/10/2012 3:44:17 AM | Einstein@Home | Computation for task h1_0382.05_S6GC1__1488_S6LV1B_0 finished
10/10/2012 3:44:17 AM | Einstein@Home | Output file h1_0382.05_S6GC1__1488_S6LV1B_0_0 for task h1_0382.05_S6GC1__1488_S6LV1B_0 absent
Copyright © 2024 Einstein@Home. All rights reserved.
Why didn't boinc recover automatically from a signature verifica
)
Which Client version are you using on this computer?
BM
BM
RE: Which Client version
)
It's this host and it's using 7.0.28.
I've seen similar behaviour from time to time - whole caches of tasks being trashed due to supposed checksum failures of critical files like executables or key data files and BOINC doesn't try to get a fresh copy of the 'corrupt' file. In all cases I've noticed, the work cache gets trashed and BOINC goes into a 24hr backoff.
I've had a case of this about a week ago with a brand new build. It had been running for a couple of days with no problems and suddenly BOINC decided there was a checksum failure in the middle of computation and trashed the work cache and went into a 24 hr backoff. Because I've seen these before (many times in total) I simply stop BOINC and edit the state file to recover the work cache and remove the spurious 'checksum failure' damage and then restart BOINC. I usually replace the 'complained about' file with a fresh copy but I was confident enough to believe there was nothing wrong with the file, so I didn't bother. Crunching restarted successfully with the fully recovered work cache so there was obviously no real checksum failure.
In the past (and also in this case) the true problem has nothing to do with BOINC. It's a flaky RAM problem and, a few hours later, the machine trashed the cache again. So I fixed the work cache yet again and this time I pulled out my Memtest CD and sure eough, a few errors were found during a full pass of all tests. However, when I did a second pass, the errors had disappeared. So I went into the BIOS and backed off the memory frequency from 1333 to 1066 but left the timings (from SPD) on auto. I ran a couple of passes of Memtest and no errors were reported so I put the machine back crunching.
This time it ran for well over a day before the next work cache trash - just when I was beginning to think the problem was 'solved' :-). So after yet another work cache recovery, I gave up 'experimenting' and replaced both sticks of RAM and put the speed back to 1333. The machine has been running without further problems with the replaced RAM. At no point did I need to replace any supposedly corrupt file.
I've actually put the flaky RAM in a further new build with a different brand motherboard and this time there don't seem to be any problems (so far). I had a feeling it might be a case of incompatibility rather than outright failure since Memtest didn't always find errors in its tests.
Cheers,
Gary.