Tasks being trashed due to MD5 checksum errors for LIGO data files

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

26 May 2011 23:08:31 UTC

Topic 195805

(moderation:

)

hey guys,

i wasn't sure if i should start a new thread or bump this one...so i guess i'll start here...

10 of my last 19 GW tasks resulted in download errors, even though they're shown as computation errors on the server...here's one of them:

Quote:

Name h1_0077.25_S6GC1__58_S6BucketA_0
Workunit 97651654
Created 19 May 2011 3:23:34 UTC
Sent 19 May 2011 5:51:14 UTC
Received 23 May 2011 11:14:32 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffffffffffff47)
Computer ID 3897627
Report deadline 2 Jun 2011 5:51:14 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Claimed credit 0.00
Granted credit 0.00
application version Gravitational Wave S6 GC search v1.01 (SSE2)

Stderr output

6.12.26

WU download error: couldn't get input files:

l1_0077.40_S6GC1
-119
MD5 check failed

]]>

keep in mind that i've had plenty of tasks complete successfully since the new GW run began. should i also try to switch my download mirror and see if that does anything?

*EDIT* - please note that this and the 9 other errored tasks did not show up as errors until my host tried to initialize crunching on them last night. if they were download errors, why didn't they automatically show up as such just after they were downloaded to my host back on the 19th?

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

Tasks being trashed due to MD5 checksum errors for LIGO data fil

26 May 2011 12:18:08 UTC

Message 105489

(moderation:

)

Quote:

i wasn't sure if i should start a new thread or bump this one...so i guess i'll start here...

I strongly believe that your particular problem is quite different from that of the OP. However, I'll respond here and hope that the OP forgives us both.

Quote:

10 of my last 19 GW tasks resulted in download errors, even though they're shown as computation errors on the server...here's one of them:

...

keep in mind that i've had plenty of tasks complete successfully since the new GW run began. should i also try to switch my download mirror and see if that does anything?

I don't think changing your download mirror will help. As an aside, if anybody wants to try a different mirror for a file, don't bother changing your timezone to achieve it. Just browse your state file for the block for the file in question and manually download the file using one of the other URLs that you will find listed there. Simply replace the existing file in your EAH project directory with the new copy you have manually downloaded.

On a particular host of mine over a period of several weeks, I had groups of tasks that would error out with exactly the same -119 error code and the MD5 checksum error message. If these were allowed to report they would show a -185 exit status and a compute error - precisely the same as yours. The stderr text was always just like yours, with a particular LIGO data file being specified as the culprit. The particular LIGO file would always be different each time the error cropped up - a bunch of tasks depending on the particular LIGO file would error out about once every couple of days. Because that host was running NNT, I was invariably able to notice the bunch of errors before they were reported.

The LIGO file was never the culprit and it was nothing to do with downloading. Independent checks of the MD5 checksums always revealed that they were in fact quite correct. Whilst I did initially replace supposedly flawed files with known good ones, it never solved the problem.

It's a very long story but in chasing this one down, I actually found I could stop BOINC (after tasks errored out but before any could be reported) and edit the state file and fully retrieve all of the supposedly errored results and have these tasks reprocessed without subsequent error and without changing the data files. This proved to my satisfaction that the problem was nothing to do with faulty data but more likely to do with faulty hardware. I solved the problem completely by turning off 'auto SPD' RAM timimgs in the BIOS and setting the individual values manually. I set the CAS latency to +1 compared to the SPD value. The tasks take very slightly longer to crunch now but I haven't seen a single error on that machine for probably a month or more.

Quote:

*EDIT* - please note that this and the 9 other errored tasks did not show up as errors until my host tried to initialize crunching on them last night. if they were download errors, why didn't they automatically show up as such just after they were downloaded to my host back on the 19th?

Exactly, and logic says they are not download errors. When one task finishes and a new task initialises, the slot directory for the new task is populated with links to the LIGO data files that are needed for the task. Some routine would be invoked to check (yet again) the validity of the LIGO data. I suspect that intermittently, something goes wrong at this point and previously valid data is identified as suddenly being invalid and all hell breaks loose for the current task and any others that also depend on the same data file. It's interesting that the problem can be solved by relaxing a memory timing. A different brand of RAM might also do the same. It also might have something to do with running sticks in dual channel mode. Although they are the same brand and size, perhaps they aren't quite compatible.

I don't know if something like this will cure your problem but I'd be looking carefully at your RAM if I were you. In my case the problem developed over time after the host had been in service for quite a while without problems. I guess something drifted out of spec to tip it over the edge.

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3001448603

RAC: 700391

Just before we leave the MD5

26 May 2011 13:47:14 UTC

Message 105490

(moderation:

)

Just before we leave the MD5 issue (and apologies to Odysseus for the thread hijack)....

The usual BOINC MD5 check is: generate MD5 on the server (and store in the database), generate MD5 again on the client after download, compare the two.

Quote:

Independent checks of the MD5 checksums always revealed that they were in fact quite correct.

The stored in the block in client_state is the server-generated MD5, as retrieved from the database: if it matches an independent locally-generated MD5, then your explanation is probably correct. But I must say I find it surprising that memory timing issues would zap the MD5 calculation function within BOINC, without affecting any other MD5 generator or BOINC's (Einstein's) general scientific calculations.

The other MD5 failure mode is when the server version gets damaged before or during its traversal through the database. That can show up in the message log:

Quote:

climateprediction.net [error] expected 202c65ba6685e55e6c0f237fa2558407, got 0609d7dbb9e71618fd60f27651c4e2e0

- 'expected' is the server copy as recorded in , 'got' is the local recalculation. In that (CPDN) case, the server-held information was corrupt.

Sunny129

Joined: 5 Dec 05

Posts: 162

Credit: 160342159

RAC: 0

thanks for the in depth

26 May 2011 19:40:38 UTC

Message 105491

(moderation:

)

thanks for the in depth response Gary. one question though - its easy enough to set NNT for E@H, but what exactly am i editing in the client state file? in other words, what exactly do i have to do to the client_state.xml to make my host recognize the errored (but not yet reported) results as tasks that have yet to be processed?

in the mean time, i'll double check my memory timings/settings...although i have to agree with Richard in that its very odd that memory could cause a checksum error, yet not cause an error in the processed data themselves.

TIA,
Eric

to the OP, sorry for the thread hijack - i thought we had the same problem at first. perhaps a forum moderator can cut-n-paste post #3 and everything subsequent to it into a new thread with an appropriate title?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

I should have moved the first

26 May 2011 23:08:31 UTC

Message 105492

(moderation:

)

I should have moved the first message about this before responding to it. My apologies to all concerned. Once I've fixed this by shifting all the discussion into this new thread, I'll answer all the subsequent responses that have arisen.

Cheers,
Gary.

Sunny129

Joined: 5 Dec 05

Posts: 162

Credit: 160342159

RAC: 0

thanks so much for moving the

27 May 2011 0:30:20 UTC

Message 105493

(moderation:

)

thanks so much for moving the thread Gary. i have E@H set to NNT, and if/when the next errors occur, we can talk about the state file.

thanks again for the insight.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

RE: ... if it matches an

27 May 2011 1:53:17 UTC

Message 105494 in response to message 105490

(moderation:

)

Quote:

... if it matches an independent locally-generated MD5 ...

It did - at least on the couple of occasions that I took the trouble to check it out. I quickly came to the conclusion that NEITHER the data itself NOR the server supplied checksums stored in the state file were corrupt. My assumption was that it had to be something to do with the way the checksum was being calculated/tested within the BOINC client.

My situation is quite special. I have ALL the LIGO data in question stored locally on a central host. I have the blocks for this data (extracted from state files) also stored locally. I have dozens of hosts sharing this data. I had only one host trashing groups of tasks that depended on this data. Initially, all the hosts sharing this data were "seeded" with a selection of blocks from the local repository to get them started on the frequency range I wanted them to use. Apart from the issues with this one host, I had nothing else to suggest there was any problem with the data or the server supplied checksums.

I would have retrieved groups of trashed tasks on this particular host at least 20 - 30 times over the (quite extended) period that I experimented with this particular problem. When I first noticed this problem, I had a large multi-day cache and NNT was set. There were about a dozen successfully completed tasks and perhaps 50+ tasks that had been trashed. There were more tasks that didn't depend on the supposedly corrupt LIGO file and the host was still working on these.

Naturally, I was quite dismayed so I stopped BOINC and started browsing the state file to try to get a handle on what was going on. It's a long story but I spent a lot of time studying the internals of the state file and I decided to attempt a recovery, knowing that E@H had the "resend lost tasks" feature enabled. By comparing a copy of a "good" state file with the current one, I found it quite easy to see what needed to be done to "undo" the damage. It was easy to see the particular LIGO file that caused the problem and "fix" the block for that file. I had a freeware MD5 tester and I used it to create a sum to compare against the value in the state file. It checked out OK but I did replace the file with a fresh copy from the local repository - which also checked out OK as well.

There were a couple of other minor things that needed fixing eg status values of -119 and -161 instead of zero (or 1) spring to mind. These were quite obvious when comparing good with bad. The biggest deal was what to do about all the .... blocks for the 50+ errored tasks. I decided to delete them all completely and see if the server would resend them once I restarted BOINC. I did spend a lot of time checking things before finally saving the changes.

So, with fingers crossed, I restarted BOINC and was quite surprised to see no complaints at all. The fully completed tasks were still there ready to report. The 50+ errored tasks were simply not listed any more. The 4 tasks in progress (quad core) all restarted from where they had left off. The remaining unstarted tasks were still there waiting to go. I left NNT in place and updated the project. The completed tasks were reported and validated. The server resent all the "lost tasks", a dozen for each time I updated. I kept a close watch on this machine as further tasks completed and there were no issues with validation. I enabled new tasks and refilled the cache without any issues - until a day or so later when it all hit the fan once again. The only difference this time was that a different LIGO file was being complained about.

As I mentioned before, I've performed the above procedure at least 20 times on this same host and never had any issue recovering the cache fully each time. There were never any issues with validation of any tasks at any time, that I noticed. I made refinements to the technique. For instance, while most of the trashed tasks had no accumulated CPU time, there were usually a couple that were part crunched at the time that BOINC decided that a LIGO file was suddenly corrupt. When this occured, I noticed that in the block at the very bottom of the state file, there were still s for each of the partly crunched failed tasks. As these had unique numbers associated with them, this gave me the idea that rather than delete the block for each failed task, maybe I could simply edit them to "fix" what BOINC had done to them and thereby get them to reload the checkpoints that were still in the slot directories. And, yes, the editing was quite easy to figure out and it worked perfectly. The checkpoints for each partly crunched task were reloaded and crunched to completion and I didn't find a single task that failed subsequently to validate. I was able to extend this editing to all failed tasks and so I stopped creating lost tasks to be resent, entirely.

I formed the view that data and checksums must all be OK, otherwise a task would never have got to the partly crunched stage in the first place. The only other thing I could think of was that the disk surface might be on the way out and perhaps there were a quite small number of occasions where a disk read was failing. I tested that by renaming each "complained about" LIGO file with a _BAD extension and then making a new copy of the file. My thinking was that, over time, I should be able to "blank out" the bad spots on the disk in this way. After 20 - 30 iterations I came to the conclusion that this theory wasn't likely to be true. That's when I decided to play with the RAM timings - with immediate success.

Quote:

I find it surprising that memory timing issues would zap the MD5 calculation function within BOINC, without affecting any other MD5 generator or BOINC's (Einstein's) general scientific calculations.

I don't claim I fully understand this problem but I know how much I persisted with the experiments and I know what ultimately worked for me. If you have a better explanation I'll be very happy to hear it.

Cheers,
Gary.

Sunny129

Joined: 5 Dec 05

Posts: 162

Credit: 160342159

RAC: 0

well i can't say that i've

27 May 2011 4:02:56 UTC

Message 105495

(moderation:

)

well i can't say that i've had any errors that have run for more than zero seconds yet. but you occasionally got one, so i suppose i should keep an eye out for it. i thought my one saving grace was that i wasn't wasting any CPU cycles on a task before it errored out, but i guess i can't always expect it to be that way.

interesting that you initially deleted the blocks of the failed tasks to get them to resend, yet it was the partially completed (but failed) tasks that turned you on to editing, not deleting, the blocks in order to prevent the resend of failed tasks altogether and pick up at their most recent checkpoint.

thanks for the basic outline to the general procedure. again, if/when i get more of these kinds of errors, i know to look to the state file. if i have any further questions on how to go about editing things, i'll be sure to post them here.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

RE: ... in other words,

27 May 2011 4:23:19 UTC

Message 105496 in response to message 105491

(moderation:

)

Quote:

... in other words, what exactly do i have to do to the client_state.xml to make my host recognize the errored (but not yet reported) results as tasks that have yet to be processed?

OK, first the usual warning. Mistakes in editing your state file can trash your entire cache. Be very careful and make sure you use a "plain text" editor. I use notepad2 for Windows, kwrite for Linux and Text Wrangler for Mac OS X. The following notes assume you have already browsed your state file and have a basic understanding of its structure and that you can identify what I'm talking about without further hand holding. If any person reading this is not in this category, I strongly suggest that you make the decision not to fiddle with your state file. Please don't ask for step by step details. I don't have time to do that, sorry.

There are 6 areas of the state file to be aware of in particular:-

1. The blocks for data files (sun, earth, skygrid, LIGO data, etc).
2. The blocks that contain task name and output information. These contain and tags, amongst other things.
3. The block which separates the previous two from the next one
4. The blocks
5. The blocks
6. The block which contains all the s

I'm describing the state file from the point of view of a single project (E@H) and a single application (the GW app). Obviously things are more complicated if you have multiple projects and multiple apps per project.

The only things you need to worry about are items 1, 2, 5, and 6 in the above list.

1. Look for a negative (-119) for the 'complained abou't data file. There will also be an block added in. Change the status to '1' and completely remove the error message. If you are satisfied the file is not corrupt, leave it alone. If you are not sure, either check it manually or perhaps use one of the mirror URLs to grab a new copy and replace it.

2. At the time of the failure some of these may have acquired a negative (-161 springs to mind). These should all have a status of zero. I started by actually deleting all the blocks that had -161 status on the basis that they were and so would be recreated. That worked fine but I also found that just changing the negative status to zero also worked.

5. This is where most of the fixup work resides. These are task/result templates where things change and/or get added at various stages during the crunching of a task. This is what finally gets sent to the scheduler via a sched_request when a completed task (successfully or after erroring out) is reported. Take a look at an 'unstarted' result. Notice things like the is zero, the is 2 and the line is immediately after the line. Then take a look at a block for an 'in progress' task and play 'spot the difference'. From memory, the main difference is that there is an entry between and .

If you get tasks that error out but are not reported, there are a few extra things to notice. The will have a negative value (-185 springs to mind in my case), the will be a value larger than 2, and there will be added lines between and , including error message lines. When I was into fixing trashed caches, all I did was search for the -185 and zero it, change the in the next line back to 2 and then remove all the added lines between and . The only exception was for partly crunched tasks where there was an entry for the task in . In that case I would look for and leave the line in place, since I was expecting the partly crunched task to be retrieved from its checkpoint (which it always was). In those cases I believe the line was always immediately above the line.

The above is all from memory but I did it so many times without incident that I'm pretty confident it's close to the mark. Browse your current state file and see if you can see all the things I'm talking about.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119039704853

RAC: 24782232

RE: well i can't say that

27 May 2011 5:33:20 UTC

Message 105497 in response to message 105495

(moderation:

)

Quote:

well i can't say that i've had any errors that have run for more than zero seconds yet.

I've seen examples of both situations. My host is a quad so it would be quite normal to have multiple tasks depending on the same data on the go at the critical time. I formed the view that the problem was triggered when one task completed and a new one was initialising. If the MD5 check for a LIGO file 'failed' at that point, any 'in progress' tasks that happened to depend on the same file would also be trashed, but would leave a valid (and recoverable) checkpoint in a slot directory. If the other 'in progress' tasks didn't depend on the same LIGO file, they would continue crunching without incident. In both cases, every other 'ready to start' task in the cache which depended on the same LIGO file would immediately be trashed as well.

I also saw quite a few examples of the 'just finished' task being labeled as a failed task even though perusing the stderr output in the block showed that crunching had finished successfully and all seemed OK. I found that I could edit these as well and have them validate successfully. From memory I think it was as simple as making sure the was zero and the was 5 and that any extraneous error messages about checksum failures were removed. There was always some sort of junk error message at the end of the block as I recall it. Very easy to spot by doing a comparison with a 'good' result.

Quote:

interesting that you initially deleted the blocks of the failed tasks to get them to resend, yet it was the partially completed (but failed) tasks that turned you on to editing, not deleting, the blocks in order to prevent the resend of failed tasks altogether and pick up at their most recent checkpoint.

It was really the sight of 100% complete tasks being junked (not to mention the many 90%+ completed tasks also) that spurred me to investigate more closely. That waste of crunching really irked :-).

Quote:

thanks for the basic outline to the general procedure.

You're welcome.

Cheers,
Gary.

Sunny129

Joined: 5 Dec 05

Posts: 162

Credit: 160342159

RAC: 0

RE: Please don't ask for

27 May 2011 14:09:04 UTC

Message 105498 in response to message 105496

(moderation:

)

Quote:

Please don't ask for step by step details. I don't have time to do that, sorry.

understood. that shouldn't be a problem...i'm not too computer illiterate ;-). besides, if you don't consider your response "detailed," i can't imagine what a detailed explanation would look like lol. i think the info you provided should be more than enough to walk me through the process...

Quote:

Browse your current state file and see if you can see all the things I'm talking about.

will do...it'll have to wait until i get home though b/c the only DC project i run on my work PC is the client_state.xml doesn't contain any info pertaining to S@H.

Tasks being trashed due to MD5 checksum errors for LIGO data files

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports