I've got one stuck WU on my machine, it's refusing to be uploaded, or the server is rejecting it for whatever reason.
I get enough work to crunch, I've got other WUs uploaded since this one stuck, but that one won't leave here.
http://einsteinathome.org/task/261241002
Here's the messages from BOINC:
Do 08 Dez 2011 19:37:31 CET | Einstein@Home | [fxd] starting upload, upload_offset 0 Do 08 Dez 2011 19:37:31 CET | Einstein@Home | Backing off 3 hr 35 min 7 sec on upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2 Do 08 Dez 2011 19:37:34 CET | Einstein@Home | [fxd] starting upload, upload_offset 0 Do 08 Dez 2011 19:37:34 CET | Einstein@Home | Backing off 10 hr 16 min 17 sec on upload of p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1
Anyone any idea what's wrong with it? Or any other flag in the cc_config I could try to narrow the problem?
Copyright © 2024 Einstein@Home. All rights reserved.
One WU not uploading, let alone report
)
Just found some new tags for the cc_config and tried them, I don't know whether they are useful:
WU unable to upload:
and just after that another one finished and got this reaction:
Grüße vom Sänger
OK, it's weekend, but still:
)
OK, it's weekend, but still: Anyone at home and with some kind of answer?
Any way to get the results back outside BOINC?
Grüße vom Sänger
OK, I'll give it a try. I had
)
OK, I'll give it a try. I had one of these once, which turned out to be a misleading error message. I'm on Windows, so I can't give you the 'how', but I can suggest 'what' to look for.
You are trying to upload a file called 'p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2'. First check: does that file exist on your system? It would be in your einstein@home project directory, if it does - but I suspect you'll find that it doesn't.
Second check: can you find the complete record of that file's attempts to upload in your old message log records - stdoutdae.txt and stdoutdae.old - in your BOINC data directory? Look for the very first upload attempt, and follow it from there. Again, I suspect that the file actually uploaded successfully on one of the early attempts, but for some reason BOINC didn't register that fact properly, and keeps retrying.
If those two checks convince you that the file has uploaded, then the third step is to manually modify your client_state.xml file to reflect that fact. You've been around for long enough to know the rules for that, but I'll restate them first for other readers who may be shoulder-surfing.
* Take a backup of the file
* use a plain-text editor only
* be very, very careful
You're lucky that it's a BRP4 task, because they upload eight files for each task, and from the sound of it seven of them uploaded OK. What you need to do is to make the 'stuck' upload look like one of the successful ones.
Here's the general shape of a completed upload:
You'll need to check, in particular, the and tags, and remove any guff relating to a "persistent file transfer".
Once you've double-checked your edits, save the file and restart BOINC. If my suspicions are correct, the task should be ready to report normally from there on.
Then we can spend the rest of the weekend wondering why a "file not found" problem gets reported in the message log as "URL not found".
Thanks Richard, I've got 3
)
Thanks Richard,
I've got 3 stuck files, *_0_0, *_0_1 and *_0_2.
None of them is still in the project folder, just the original 9 WU-files including the *.zap.
The stdout.* are too young, too many messages pile up there thanks to too many flags in the cc_config ;)
I'm quite convinced, that the files have uploaded fine, just my stupid BOINC here didn't get it.
Stopping BOINC is off the menu until my current yoyo and RNA are finished (no checkpoints).
I'll try tomorrow or Monday, depending on the accuracy of the estimate runtime of those WUs.
The client_state.xml says this about those buggers:
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1
4086.000000
6000000.000000
ec99f53814b50bbdf9e7a2dfd1230849
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
11
1323343541.805885
1323551394.800235
0.000000
0.000000
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_1
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
4025.000000
6000000.000000
f3d15b58bb69e33997ccd666eaad6448
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
12
1323343541.805885
1323558421.274246
0.000000
0.000000
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_2
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_3
4034.000000
6000000.000000
c1b14380950ed36779fbc2f0b350bd1a
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_3
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_4
4016.000000
6000000.000000
d2695520e63434d403967acf43428dbd
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_4
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_5
4078.000000
6000000.000000
d5071abc6b07b80cec46815c45687144
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_5
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_6
4061.000000
6000000.000000
51568a5e552d6b42d4000af4f71445f7
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_6
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_7
4003.000000
6000000.000000
ce49951c1ee709907cb35c1512d34f57
0
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
p2030.20100614.G46.20-00.29.C.b0s0g0.00000_3344_0_7
6000000
http://einstein-dl.aei.uni-hannover.de/cgi-bin/file_upload_handler
signature
.
I think, I'll simply copy one of the other 8 entries and just change the *_0_X-part ;)
Wrong Idea, I see, that the checksums are different, but I'll somehow manage methinks.
Grüße vom Sänger
RE: I think, I'll simply
)
Might be better idea to just insert manually some tags for the entries that are stuck, but Richard will know the best way to proceed.
HBE
I've copied the interesting
)
I've copied the interesting parts in a calc-sheet, and this is my plan:
Insert in the red field
delete the pink lines
keep the grey ones as they are
Any objections?
Grüße vom Sänger
Yes, that looks fine - insert
)
Yes, that looks fine - insert one line, delete 7 lines, no other changes (per file).
I'm not sure whether BOINC will detect that the task is now "Ready to report" automatically on restart, or whether you need to make one more edit. Try it and see: if it stays stuck at "Uploading", here's the receipe:
That's change state from '4' to '5', and further down - a long way further down, I snipped about 400 lines of - add two new lines.
is self explanatory.
I don't think the actual value in matters too much (and it certainly doesn't need to be accurate to the microsecond). Just put in something that will pass a rudimentary sanity check (after the task was issued to you, and before you're going to try and report it).
That one finished around 8pm this evening, which meets both tests: you may as well copy this line:
1323547759.296875
It worked, just has to be
)
It worked, just has to be validated.
Thank you very much for your help :)
Now you can think about that, but probably the other weekend ;):
Edith says:
It just validated while I wrote this post here.
Grüße vom Sänger
RE: It worked, just has to
)
Yes, thanks very much to Richard for his very clear instructions. I haven't seen this particular issue previously but I had a problem about a year ago which also could be recovered by careful editing of the state file. This problem gave an error message where, quite suddenly (and with 4 tasks in the middle of being crunched) a large data file for a GW task was declared as having an incorrect MD5 sum. The four tasks in progress were immediately errored out and all unstarted tasks that depended on the same large data file also errored out without being started. For some reason, there were no immediate communications with the project and this was being prevented by a 24 hour backoff. The first time it happened, I was fortunate to notice the situation before the 24 hour backoff had expired. So I had time to shut down BOINC and analyse the situation while everything was 'as it was' immediately after the problem occurred.
I couldn't figure out why a data file would suddenly become bad so I ran manually a MD5 checksum utility and generated a checksum for the file which I then compared with what was stored in the state file. The generated MD5 sum agreed perfectly with what was stored. The only thing I could come up with was that perhaps there was a flakey disk sector being covered by that file and occasionally, a read of that file was returning bad data. To test that out, I renamed the data file with a _BAD extension (so the bad sector would remain covered) and then put a fresh copy into the project folder, hopefully into a 'good' location on the disk.
I then browsed the entire state file looking for what had been inserted or changed as a result of the problem. The first thing was the block for the 'corrupt' data file. I think it had a -161 status and an inserted - something like that and quite easy to restore. There were other blocks for tasks themselves that had and tags inside them. There were a lot of these as I had about 80 tasks in the cache of work. Since they were , I figured the best thing to do was delete all those that had signs of damage. The blocks all looked OK so I left those all alone.
I was now down into the blocks section of the state file. There were a couple of completed and uploaded but not reported tasks and I was very keen to preserve those. There were 4 'in progress' tasks which were also recorded in the section right near the bottom of the state file. By looking carefully at those, I realised that here was recorded information for when the last checkpoints were written immediately prior to the error occurring. I also noticed that the slot directories seemed to be intact so I reasoned that it might be possible to fix things and restart these 4 tasks from their saved checkpoint data that was still physically in the respective slot dirs.
So I formulated the plan to simply delete all the errored and save the completed and uploaded ones. I then had to edit the 4 in-progress ones to allow them to restart from saved checkpoints. I figured I could use the 'resend lost results' feature of the server to give me back all the failed tasks that I had deleted from the state file and in the process, all the stuff would also be restored.
To get this right, I simply stopped BOINC on a good machine and then browsed the 'good' state file to see exactly what was recorded for the on that machine. By looking at differences between the 'good' and 'bad' state files, it was very easy to see what to do. As I recall, I had to change values for and and then remove a series of contiguous lines of messages that had been added to each at the time of the failure.
To cut a long story short - it all worked. When I restarted BOINC, there were no error messages. The tasks list showed exactly those tasks I was attempting to save - the fully completed ones and the 4 'in-progress' ones, which had all successfully restarted from their saved checkpoints. I was quite elated about this. The final part was to 'update' the project (which was still counting down the remainder of the 24 hour backoff) and see if the server would resend the lost results. And, yes, it did exactly that in batches of 12 lost results at a time.
That machine ran for a couple of days with all good results until exactly the same thing happened again, but with a different large data file failing the MD5 check this time. So I figured it must be another bad sector and so repeated the above process. After more than 10 inerations of this, I finally decided I had to abandon the 'bad sector' theory. Although it was a different data file each time, I figured that it was unlikely that there were so many flakey sectors that I couldn't detect by other means. So, if it wasn't bad data on the disk, I figured it had to be bad data in memory. So I bought 2 new sticks to replace the existing ones. For more than a year now, the problem has not recurred on that machine.
The good thing about it all was that I got plenty of chances to practice my state file editing skills. I also tried recovering blocks rather than deleting them and forcing the server to resend them. That worked fine too but if you have 80 failed it's rather tedious to go through them all and to make the corrections to each one. I did it a couple of times 'just for practice' :-). It's much quicker to delete them and have the server resend them.
Cheers,
Gary.
RE: ... Then we can spend
)
I presume you are referring to these two lines
which doesn't actually say that a URL wasn't found. I interpret it to mean that the file_upload_handler (whose URL was given) has reported back a 'not found' message for the file that it was told to upload - the file whose name was given in the first line.
If this wasn't the log entry you were referring to then please excuse this interruption.
In your most recent message, I would be fairly confident that the change of from 4 to 5 and the addition of the extra two lines, particularly would have been needed to make it all succeed. Thanks very much for taking the trouble to diagnose and explain the problem so clearly.
Cheers,
Gary.