I have made a search for an answer to this, but I have not been able to find one.
Suddenly and without apparent reason, most of the tasks I have received over the past week or so have not survived download. I am in danger of running out of work. As far as I can tell, the error message is the same in each case:
6.10.21
WU download error: couldn't get input files:
h1_0944.50_S5R4
-119
MD5 check failed
]]>
Only this machine is involved:
http://einsteinathome.org/host/2412094/tasks
Only S5GC1 tasks are affected.
Is it something I have done, is there something I can do?
Verloren ist nur, wer sich selbst aufgibt. - Hans-Ulrich Rudel
Copyright © 2024 Einstein@Home. All rights reserved.
WU download error: couldn't get input files.
)
Bump. It's been over a fortnight since the original post, but there hasn't been a response.
It is a reasonable question; the error seems to be fairly common. Would anyone like to help? (And yes, after an interval the same computer is getting the same download error again).
Verloren ist nur, wer sich selbst aufgibt. - Hans-Ulrich Rudel
Hi, As only one of your
)
Hi,
As only one of your PCs is affected, and we haven't seen similar complaints from other users, the error must be on your end, I'm afraid.
It's hard to do a remote diagnosis of your PC, but the obvious components to check are the network adapter, the disk, and the memory. There are are several freeware tools for hardware checking and stress-testing those components.
Are your other computers sharing the same internet access with this PC that has problems? That would at least eliminate some network related problems.
CU
HB
I am currently working on (4)
)
I am currently working on (4) tasks and when I try to download additional work I receive the message that I have reached my daily quota of 128 tasks. Where are the additional 124???
It might help if you click
)
It might help if you click the "Show all tasks" button in the Tasks tab. ;-)
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
This WU
)
This WU http://einsteinathome.org/task/204063432 is one of 29 other WUs that errored out and going back in the messages found that the missing input file had been erased earlier by the server -
04/11/2010 12:26:21 Einstein@Home Got server request to delete file h1_1148.00_S5R4
04/11/2010 12:26:21 Einstein@Home Got server request to delete file l1_1148.00_S5R4
04/11/2010 12:26:21 Einstein@Home Got server request to delete file l1_1148.00_S5R7
04/11/2010 12:26:21 Einstein@Home Got server request to delete file h1_1148.00_S5R7
RE: This WU
)
I made an active link to the WU.
That task is named h1_1147.90_S5R4__127_S5GC1a_0 though and so shouldn't need the 1148.00 files anyway.
And the request to delete is usually only executed when all tasks needing the specified files are finished.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
All 29 WUs were h1_1147 type
)
All 29 WUs were h1_1147 type but errored because h1_1148_00_S5R4 had been deleted earlier by the server
RE: All 29 WUs were h1_1147
)
Hi Geoff,
Welcome to the Einstein project!
Gundolf was incorrect when he suggested that a task labelled as '...1147.90...' wouldn't need the '...1148.00...' data files. In fact a 1147.90 task would need ALL data files (four for each frequency bin) between 1147.90 up to about 1148.30 or 1148.35 or perhaps even 1148.40. That is indeed quite a large number of data files and if just one were missing or corrupt, then the E@H science app would be unable to proceed.
However you are not correct to blame the server request to delete the 1148.00 files as being the cause of the 29 tasks erroring out. The system is a bit smarter than that :-). As Gundolf also suggested, a server request for deletion is never actioned by the client until ALL tasks depending on that data file have actually been completed, uploaded, reported, and the report acknowledged by the server. So there is no way that the client will allow the deletion to occur until it is satisfied that the files are truly no longer required.
When a delete request is received from the server, the only thing that happens immediately is that a flag, , is inserted in the state file (client_state.xml) for that particular data file. The BOINC client running on your host is continually aware of this flag and will only action it when the above mentioned conditions are satisfied. You can actually completely override the deletion if you stop BOINC, remove the flag from your state file, and then restart BOINC. I do this all the time. I hasten to add that you should NOT do this unless you have very good reasons and you fully understand what you are doing. I have specific (and rather complicated) needs for my fleet of crunching hosts so I need to keep my data files for rather longer than the server allows. It can be done quite quickly manually, or extremely quickly with a script.
So we have to look for other reasons why the file in question might have been missing or corrupt at the time the science app needed to use it. I have personally run into quite a few of these hiccups where a whole raft of tasks have suddenly been invalidated because of (supposedly) missing or corrupt data. I'll give three specific examples that I've been able to diagnose. There are probably more.
Disk Bad Sector
Data files can hang around for quite a while and be reused for quite a large number of different tasks. That is the whole point of Locality scheduling that E@H uses. If a bad sector happens to develop in an area occupied by a file, then .... you can guess what will happen. I've had this happen a couple of times over the years and I've confirmed the diagnosis by renaming the file to xxxx_BAD to effectively block that area of the disk to further use. Installing a fresh, known good copy of the file has then resolved the issue. Many people would immediately replace the disk but all my crunching machines are running on quite old, mainly 20GB disks, some of which are surviving quite well with some bad sectors. I've been quite surprised at how long they keep going. Many of these disks are more than 8 years old and in continuous 24/7 use.
MD5 Checksum Failure
I see quite a few of these. When you are sent a file, you also receive an MD5 checksum for that file which is also stored in your state file. Each time a file is going to be used for a new task, its checksum is checked. For reasons I don't understand, a perfectly good file can fail this checksum test. I know the file is good because I run a separate standalone program to compute the checksum and it indeed does agree with the checksum stored in the state file. When this problem bites, a large portion of my cache of work gets trashed. If the trashed work hasn't been reported to the server, it can always be fully recovered. In my case this is very likely because for a large part of each day, I run with NNT (no new tasks) set. So, completed tasks (both good and trashed) will be sitting there until NNT is rescinded. It's quite easy for me to notice and recover trashed tasks - but it needs quite a bit of patience and some state file editing skills so is not really for the faint hearted :-).
The key is to understand the 'resend lost tasks' BOINC feature that E@H kindly uses. In a nutshell, and with the proviso that the trashed tasks haven't been reported, all you need to do is stop BOINC and then remove the .... blocks for all trashed tasks in your state file. You need to also remove all result templates (.....) for those same trashed tasks. You need to do this patiently and with care so that you don't completely trash your state file. When you restart your BOINC client, it will still see any good tasks that were in your cache but all signs of the trashed tasks that you have edited out, will be gone from your BOINC Manager tasks list. If you then allow your client to contact the server, your good tasks will be reported and the server will notice the missing tasks you are supposed to have and the 'resend lost tasks' feature will do the rest. If you have confirmed to your own satisfaction that the suspect data file is not really corrupt after all (or if you have replaced it if it was corrupt) the lost tasks that have been resent will crunch without further issue. I have actually done this many times (probably more than 20) in the last year or two. I have yet to find a trashed cache that couldn't be recovered - provided of course that it hasn't been reported.
Heat/Overclocking Related Problems
I run virtually all my crunching hosts overclocked. I try not to push too hard :-). I do take quite a bit of care to test for stability and don't usually have too many problems. A lot of my hosts don't run in air-conditioned rooms so they are subject to problems on a particularly hot summer's day. Sometimes hosts will just lock up or crash and then continue on without issue when rebooted. Occasionally, a task or two will error out as a result. Sometimes, but quite rarely, the whole cache will get trashed. This latter behaviour seems to be particularly heat related and the OS still seems to be running OK after all the cache is trashed. On some occasions when this happens, the error messages seem to suggest the point of failure is something to do with a zipping/unzipping routine that fails. I'm not sure of the precise details but heat really does seem to be the culprit.
Once again, these trashed tasks are fully recoverable (provided they haven't been reported) and the host is able to resume normal crunching if I take steps to lower the room temperature a bit :-).
So, it's not clear why your cache got trashed but it's certainly not because the server issued a delete request. It's very difficult for anyone other than you to know why your cache got trashed. It could happen again and is likely to do so if you are overclocking at bit too aggressively - particularly if you are using an inadequate cooling solution. I had a quick look at one of your failed tasks and sure enough - the infamous MD5 checksum error - as described above. Good luck with solving the problem.
Cheers,
Gary.
Hi Gary, Many thanks for
)
Hi Gary,
Many thanks for your long explanation and I am learning more about how Einstein crunches, I have looked at the BOINC-Slots folders and can see many files are needed to crunch a GC wu.
According to S.M.A.R.T. my HD has no problems and this PC is overclocked but I have additional cooling and it has been crunching SETI for the last 4 years with very few errors.
I can only assume that the file in question was corrupted somehow.
Geoff
RE: ...this PC is
)
Einstein tasks tend to produce more heat than SETI tasks (from my experience).
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)