I have a box (ID: 6396537) which has successfully crunched over 9 million credit worth of Einstein GPU WU's in the past 7 months, but I noticed that lately (at least past month or so) that MOST (~59%) of the units have been failing with "Computation error", and another 15% of the ones that seem to complete OK end up as "Validate error".
I can't think of any hardware changes or significant software changes (besides normal Ubuntu package updates) that would affect BOINC stability, but I can't figure out what's causing them all to fail either (no helpful error messages in logs that I can see), so I thought I'd report it here.
Might just be some bit of hardware on my end is getting old and worn out. (I noticed previously with running 4-8 CPU WUs + 2 GPU WUs full time that the CPU seemed to get hotter than the range that sensors util suggests is too high, so I have had it limited to using only 1 CPU WU at a time for a long time. GPU blows CPU out of the water anyway so it's not much loss) GPU Temps as reported by the nVidia software have always been between 69-75C while BOINC running, which seems to be well within the normal acceptable range.
But while I was examining some of the failed WUs on the web site I noticed that quite a few of them showed errors from other people's computers too, so maybe something changed in the software a month or two ago that's causing more to fail. (That might be normal tho, I never really looked at result pages on the web before when things were working fine.)
If nobody else is experiencing similar problem then I guess I'll have to stop running BOINC and start replacing hardware, because it's not worth continuing with a 75% failure rate. But if someone else has experienced something similar that's software related, please let me know what you did to fix it!
System Specs:
OS: Ubuntu 12.10
Kernel: Linux 3.5.0-27-generic
CPU: Intel Core i7 CPU 920 @ 2.67GHz [Family 6 Model 26 Stepping 5] (8 processors)
GPU: [2] NVIDIA GeForce GTX 295 (895MB)
BOINC: Had been using the Ubuntu packaged version (7.0.27) from official repo whole time, but it hadn't been updated in a while, even tho BOINC kept mentioning that a new version was available. When I started noticing all the failures lately, I thought maybe it was time to upgrade.
So after letting all WUs finish processing, yesterday I manually downloaded and installed the latest generic Linux version available (7.0.65) to a new location, and copied over the .XML files from the old install. (To maintain which projects I was attached to and the Machine IDs & settings.) I then "Reset Project" on all of them, and then hit "Allow New Tasks" on ONLY Einstein (no CPU WUs of any kind being run).
The application files were downloaded automatically along with some WUs. Most of those WUs also failed.
Example of failed CUDA WU:
http://einsteinathome.org/task/396378942
All of the ones that failed seemed to end with same vague and meaningless (to me!) log entries:
[stdoutdae.txt]
13-Aug-2013 14:02:27 [Einstein@Home] Computation for task PA0016_01221_336_0 finished 13-Aug-2013 14:02:27 [Einstein@Home] Output file PA0016_01221_336_0_0 for task PA0016_01221_336_0 absent 13-Aug-2013 14:02:27 [Einstein@Home] Output file PA0016_01221_336_0_1 for task PA0016_01221_336_0 absent
("Output file ___ for task ___ absent" seems to be generic error for "something went wrong")
[viewing Task output on website]
[13:58:37][3908][INFO ] Checkpoint committed! [14:02:24][3908][ERROR] Couldn't bind CUDA HSP texture (error: 700)! [14:02:24][3908][ERROR] Demodulation failed (error: 1007)! 14:02:24 (3908): called boinc_finish
ONLY reference on entire web I can find to this error ("Couldn't bind CUDA HSP texture") is on this page which seems to be source code for an Einstein app, but reading it doesn't help me understand what the problem might be or how to fix it:
http://code.metager.de/source/xref/boinc/Einstein@Home/BinaryRadioPulsar/src/cuda/app/demod_binary_hs_cuda.cu
Example of WU that seemed to complete OK but got Validate error:
http://einsteinathome.org/task/395870305
Task 395870305
Name PA0015_00851_332_0
Workunit 172019365
Created 9 Aug 2013 18:12:50 UTC
Sent 9 Aug 2013 19:45:43 UTC
Received 10 Aug 2013 20:38:03 UTC
Server state Over
Outcome Validate error (8:00001000)
Client state Done
Exit status 0 (0x0)
Computer ID 6396537
Report deadline 23 Aug 2013 19:45:43 UTC
Run time 9,471.22
CPU time 2,813.40
Validate state Invalid
Claimed credit 30.18
Granted credit 0.00
application version Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270)
Stderr output
7.0.27
[14:00:05][9223][INFO ] Application startup - thank you for supporting Einstein@Home!
[14:00:05][9223][INFO ] Starting data processing...
[14:00:05][9223][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 56 MB (840 MB free / 896 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[14:00:05][9223][INFO ] Using CUDA device #0 "GeForce GTX 295" (240 CUDA cores / 933.12 GFLOPS)
[14:00:05][9223][INFO ] Version of installed CUDA driver: 5000
[14:00:05][9223][INFO ] Version of CUDA driver API used: 3020
[14:00:05][9223][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[14:00:05][9223][INFO ] Header contents:
------> Original WAPP file: ./PA0015_00851_DM1344.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 53346.653988212835
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 63034.6208
------> DEC (J2000): 81651.1969986
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4328496
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 1344 cm^-3 pc
------> Scale factor: 1.81818
[14:00:06][9223][INFO ] Seed for random number generator is 1081501286.
[14:00:06][9223][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-08
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[14:00:06][9223][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 176 MB (720 MB free / 896 MB total) -> Used by this application (assuming a single GPU task): 120 MB
[14:10:05][9223][INFO ] Checkpoint committed!
[14:20:06][9223][INFO ] Checkpoint committed!
[14:30:07][9223][INFO ] Checkpoint committed!
[14:40:08][9223][INFO ] Checkpoint committed!
[14:50:08][9223][INFO ] Checkpoint committed!
[15:00:09][9223][INFO ] Checkpoint committed!
[15:10:10][9223][INFO ] Checkpoint committed!
[15:19:00][9223][INFO ] Statistics: count dirty SumSpec pages 3005 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[15:19:00][9223][INFO ] Data processing finished successfully!
[15:19:00][9223][INFO ] Starting data processing...
[15:19:00][9223][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 56 MB (840 MB free / 896 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[15:19:00][9223][INFO ] Using CUDA device #0 "GeForce GTX 295" (240 CUDA cores / 933.12 GFLOPS)
[15:19:00][9223][INFO ] Version of installed CUDA driver: 5000
[15:19:00][9223][INFO ] Version of CUDA driver API used: 3020
[15:19:00][9223][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[15:19:00][9223][INFO ] Header contents:
------> Original WAPP file: ./PA0015_00851_DM1354.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 53346.653988003993
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 63034.6208
------> DEC (J2000): 81651.1969986
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4328496
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 1354 cm^-3 pc
------> Scale factor: 1.81818
[15:19:01][9223][INFO ] Seed for random number generator is 1066192077.
[15:19:01][9223][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-08
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[15:19:01][9223][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 176 MB (720 MB free / 896 MB total) -> Used by this application (assuming a single GPU task): 120 MB
[15:20:11][9223][INFO ] Checkpoint committed!
[15:30:11][9223][INFO ] Checkpoint committed!
[15:40:12][9223][INFO ] Checkpoint committed!
[15:50:13][9223][INFO ] Checkpoint committed!
[16:00:14][9223][INFO ] Checkpoint committed!
[16:10:15][9223][INFO ] Checkpoint committed!
[16:20:15][9223][INFO ] Checkpoint committed!
[16:30:16][9223][INFO ] Checkpoint committed!
[16:37:54][9223][INFO ] Statistics: count dirty SumSpec pages 2476 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[16:37:54][9223][INFO ] Data processing finished successfully!
16:37:54 (9223): called boinc_finish
]]>
Copyright © 2024 Einstein@Home. All rights reserved.
Recent Massive Errors on GPU WUs on Ubuntu+NVIDIA
)
I had a quick look as some of the tasks with errors/invalids and compared against the valids.
GPU0 seems to be generating most - maybe all - of the problems.
Whilst that points to a hardware issue, i find on my two GPU system - #0 is always the one to error.
I can force the error by simply logging in and out or hot keying to a std terminal Ctl-Alt-F1 etc, eventually tasks error out. Logging in and out - has the same effect.
HTH.
Hmmm, that is weird,
)
Hmmm, that is weird, particularly because it's a dual GPU card, so they're both actually on the same video board...
So you're saying that switching virtual terminals or logging out of X (or even rebooting?) causes any WUs that are currently being processed to fail??
I know that I definitely haven't logged off or rebooted anywhere near enough times to cause that many errors, but perhaps the "auto-lock" of the screen due to inactivity has the same effect?
RE: Hmmm, that is weird,
)
GPU0 could still be faulty, and GPU1 working. Perhaps you could change monitor to using GPU1 or the on-board video, another option might be to stop X for a while. I´m not sure what nVidia drivers you are running.
Yes i can force an error, but not the same error exactly.
See here http://einsteinathome.org/account/tasks&offset=0&show_names=1&state=5&appid=0
for a recent batch. The errors vary as i recall, running on nVidia drivers 310.32 and earlier. I haven´t tried 325.15 which i´m now running.
I don´t recall (or expect) a reboot causing errors, as tasks would normally stop cleanly, then restart from a checkpoint.
Maybe not, i suggest looking over the tasks in a bit of detail and noting the frequency and times that errors occur. tbh i did not find a single working GPU0 task - after looking at 10 or so tasks.
RE: RE: Hmmm, that is
)
Using the ones from "nvidia-current" package, which says version: "304.88-0ubuntu0.1". I'll see if I can set BOINC to use only GPU1 for now.
The link you gave must only be visible to you while logged in because it gave me access denied error. But I think you were on the right track and I looked deeper into the ones of mine that failed.
I downloaded the result pages of all the ones that failed that are still available on the web, and all but 2 of them had used GPU0 at some point in their processing. The 2 that used GPU1 that didn't succeed yet both had status of "Completed, validation inconclusive", and all the other computers who tried to crunch them had errors too, so hopefully they're just bad WUs.
All of the recent ones that succeeded used only GPU1, so we'll see if I can force BOINC to use only GPU1 and see how those units turn out. Thanks for your help!
RE: Using the ones from
)
The Ubuntu rep versions are fairly old, i would suggest you upgrade.
http://einsteinathome.org/node/197123 has a few links about the latest drivers. You will need to do a little research on how to install them, and then avoid ubuntu upgrades overwriting them later.
ok try these!
http://einsteinathome.org/host/4918234/tasks&offset=0&show_names=1&state=5&appid=0
np
In case someone comes across
)
In case someone comes across this thread later with a similar problem and needs to disable a single specific GPU device on their system, these are the lines I added to the (original default) cc_config.xml file:
Add them within the tags but NOT within .
Next week I'll try re-enabling GPU0 without X running for a while and see they are successful again. That would at least narrow it down to a software issue rather than hardware.