large number of "Error while Computing"

Anonymous
Topic 197377

noticed this morning that on a win 7 box i received a large number of "Error while computing" failures. Nothing on this box has changed. Same GPU, drivers etc.

When looking at the "Stderr output" for these jobs I see the following:

7.2.33

The system cannot find the file specified.
(0x2) - exit code 2 (0x2)

Activated exception handling...
[08:38:37][4936][INFO ] Starting data processing...
[08:38:37][4936][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 267 MB (1782 MB free / 2049 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[08:38:37][4936][INFO ] Using CUDA device #0 "GeForce GTX 650 Ti" (0 CUDA cores / 0.00 GFLOPS)
[08:38:37][4936][INFO ] Version of installed CUDA driver: 6000
[08:38:37][4936][INFO ] Version of CUDA driver API used: 3020
[08:38:38][4936][INFO ] Thank you but this work unit has already been processed completely...
[08:38:38][4936][ERROR] Input file on command line ../../projects/einstein.phys.uwm.edu/PA0064_03671_142.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/PA0063_018A1_269.bin4 from checkpoint header.
[08:38:38][4936][ERROR] Demodulation failed (error: 2)!
08:38:38 (4936): called boinc_finish

]]>

Boinc Manager now states that Communication is deferred for 10+ hours.

[EDIT]
noticed a daily quota exceeded on the site for this machine. also no work available for the two types of jobs I process on this machine. Does it actually imply "no WUs" available or is this a condition of "quota exceeded"?

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 580491504
RAC: 131010

large number of "Error while Computing"

"0 CUDA cores / 0.00 GFLOPS" is surely wrong, but I don't know if this causes any problems.

"Thank you but this work unit has already been processed completely..." is apparently very wrong, since it happens with all your recent WUs.

"Input file on command line ../../projects/einstein.phys.uwm.edu/PA0064_03671_142.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/PA0063_018A1_269.bin4 from checkpoint header."

The files have different names, so I would not expect them to be similar.

Did you remove Einstein data files manually? If so you might have deleted a bit too much. There was some recent discussion about this, but I forgot who it was.

Otherwise: power down the machine, take the power cord off for 10+ minutes and reboot. If it still happens reset the project.

And it's probably normal that you can't get more work for now, since you got many errors subsequently.

MrS

Scanning for our furry friends since Jan 2002

Anonymous

RE: "0 CUDA cores / 0.00

Quote:

"0 CUDA cores / 0.00 GFLOPS" is surely wrong, but I don't know if this causes any problems.

"Thank you but this work unit has already been processed completely..." is apparently very wrong, since it happens with all your recent WUs.

"Input file on command line ../../projects/einstein.phys.uwm.edu/PA0064_03671_142.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/PA0063_018A1_269.bin4 from checkpoint header."

The files have different names, so I would not expect them to be similar.

Did you remove Einstein data files manually? If so you might have deleted a bit too much. There was some recent discussion about this, but I forgot who it was.

I did not remove any files. But I did notice that the profile I am running on said to use NVIDIA and ATI GPUs. This machine has NVIDIA GTX 650Ti only so I "unchecked" the ATI GPU box. and did an update to the project through Boinc Manager. I do not recall seeing anything unusual after the update so I don't believe that would have caused this issue. Besides some of the jobs error'd before the change to the profile.

Quote:


Otherwise: power down the machine, take the power cord off for 10+ minutes and reboot. If it still happens reset the project.

And it's probably normal that you can't get more work for now, since you got many errors subsequently.

MrS

at present I am waiting for the "quota timeout" to expire. Quite strange.

Anonymous

I am believing that E@H is

I am believing that E@H is having a problem. Here is why. After the communications deferred time elapsed I receive some more BRP 5 jobs. These also errored out. I then removed the E@H project from this computer. I watched the entry for einstein project under ../BOIONC/projects be removed. However in slot 0 and slot 1 there were still references to Einstein jobs. These directories were not cleaned. I then added the project back in. Downloads followed and the event log reported the following:

2/9/2014 11:20:56 PM | | Host location: none
This is not true. This PC is assigned to the "work" location/profile

and these entries:

2/9/2014 11:25:03 PM | Einstein@Home | Started download of einstein_icon.png
2/9/2014 11:25:03 PM | Einstein@Home | Starting task PA0065_00771_228_1 using einsteinbinary_BRP5 version 139 (BRP5-cuda32-nv301) in slot 0
2/9/2014 11:25:08 PM | Einstein@Home | Finished download of einstein_icon.png
2/9/2014 11:25:08 PM | Einstein@Home | Started download of eah_slide_11.png
2/9/2014 11:25:09 PM | Einstein@Home | Computation for task PA0065_00771_228_1 finished
2/9/2014 11:25:09 PM | Einstein@Home | Output file PA0065_00771_228_1_0 for task PA0065_00771_228_1 absent
2/9/2014 11:25:09 PM | Einstein@Home | Output file PA0065_00771_228_1_1 for task PA0065_00771_228_1 absent

2/9/2014 11:25:12 PM | Einstein@Home | Finished download of eah_slide_11.png

If one removes a project and reinstalls it how can there be "absent" files?

The project went immediately into "communication deferred" for 20+ hours.

Should I remove the project and remove BOINC? and manually remove the slot directories to ensure a clean reinstall BOINC and then the project?

I really don't see that I have done anything to cause this condition. For now I have suspended E@H on this PC. Looking for suggestions.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: I am believing that E@H

Quote:
I am believing that E@H is having a problem. Here is why.
...
If one removes a project and reinstalls it how can there be "absent" files?


It's not Einstein having the problem, it's Boinc.
The message about absent files is because the science-app crashes and cannot produce a result file that Boinc expects, so Boinc complains about the result file/output file missing.

Quote:
The project went immediately into "communication deferred" for 20+ hours.


Boinc defers communication with the project when a task errors out, don't ask me why but that's how it works... The timer can easily be reset by doing a project update.

Quote:

Should I remove the project and remove BOINC? and manually remove the slot directories to ensure a clean reinstall BOINC and then the project?

I really don't see that I have done anything to cause this condition. For now I have suspended E@H on this PC. Looking for suggestions.


If it where my computer I'd remove Einstein again and if there are no other projects attached uninstall Boinc and manually delete the data directory. Then reinstall. Something must have happened to prevent Boinc from cleaning out the slots after a task either completed or failed, the files remaining in the slot directory then interferes with new tasks and since the new tasks don't reference the offending files Boinc can't handled the cleanup as it should and your stuck with tasks crashing. A clean install of Boinc should fix that.

Anonymous

RE: RE: I am believing

Quote:
Quote:
I am believing that E@H is having a problem. Here is why.
...
If one removes a project and reinstalls it how can there be "absent" files?

It's not Einstein having the problem, it's Boinc.
The message about absent files is because the science-app crashes and cannot produce a result file that Boinc expects, so Boinc complains about the result file/output file missing.

Quote:
The project went immediately into "communication deferred" for 20+ hours.

Boinc defers communication with the project when a task errors out, don't ask me why but that's how it works... The timer can easily be reset by doing a project update.

Quote:

Should I remove the project and remove BOINC? and manually remove the slot directories to ensure a clean reinstall BOINC and then the project?

I really don't see that I have done anything to cause this condition. For now I have suspended E@H on this PC. Looking for suggestions.


If it where my computer I'd remove Einstein again and if there are no other projects attached uninstall Boinc and manually delete the data directory. Then reinstall. Something must have happened to prevent Boinc from cleaning out the slots after a task either completed or failed, the files remaining in the slot directory then interferes with new tasks and since the new tasks don't reference the offending files Boinc can't handled the cleanup as it should and your stuck with tasks crashing. A clean install of Boinc should fix that.


Sounds like a plan. Will remove E@H and BOINC and clean up the BOINC data directory. Reboot. Then reinstall BOINC and E@H.

Anonymous

RE: If it where my

Quote:
If it where my computer I'd remove Einstein again and if there are no other projects attached uninstall Boinc and manually delete the data directory. Then reinstall. Something must have happened to prevent Boinc from cleaning out the slots after a task either completed or failed, the files remaining in the slot directory then interferes with new tasks and since the new tasks don't reference the offending files Boinc can't handled the cleanup as it should and your stuck with tasks crashing. A clean install of Boinc should fix that.


Quote:

Sounds like a plan. Will remove E@H and BOINC and clean up the BOINC data directory. Reboot. Then reinstall BOINC and E@H.

Like they say in the "World of Windows": TAH TAAH

Removing E@H project and BOINC with a manual cleanup of the data directory seems to have resolved the problem. I would certainly like to understand what put this node into a tail spin.

Thanks all.

EDIT: how/when do all of the jobs in the "error" queue get cleaned up? i.e., how are the queues managed?

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: Like they say in the

Quote:

Like they say in the "World of Windows": TAH TAAH

Removing E@H project and BOINC with a manual cleanup of the data directory seems to have resolved the problem. I would certainly like to understand what put this node into a tail spin.

Thanks all.

EDIT: how/when do all of the jobs in the "error" queue get cleaned up? i.e., how are the queues managed?

Good to hear that it worked, as to why I don't know but i suspect that something interfered with the cleanup of the slot directory and then off it went...

The tasks in the error queue will be removed when at least 2 other hosts has completed them and validated or the max no of errors is reached, then after all outstanding tasks either report in or reach the deadline add the project selectable amount of time to let them linger in the database before they are removed. If I remember correctly that time used to be 7 days here but Bernd adjusted it to something shorter a while back to reduce the size of the database, after checking my results I'd guess it's set to 5 days now but don't quote me on that. =)

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 580491504
RAC: 131010

Good to hear it works

Good to hear it works now!

Some possible reasons: there could have been some temporary hardware failure (CPU, memory, disk) which corrupted some write operation in a slot, leaving it in a weird state. CPUs are hardened against high energetic cosmic particle impacts, memory is not. And HDDs have a certain specified bit error rate (desktop class: >1^14, enterprise: >1^15). So it's kind-of-normal when every now and then something strange happens to computer systems.

But by far the most errors are caused by software, of course :D

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.