Hey Guys!
I've been happily crunching my way through data without an issue - up until 30minutes ago. Dropped by to check how things were progressing, and an entire bank (20) of Binary Radio Pulsar Search v1.00 (BRP3cuda32) had failed within the first 10seconds of processing.
Being a newbie to Einstein@home, is this a bad batch of files, or a processing issue? Upon seeing them I shutdown the laptop for a few minutes and restarted - everything seems to be happily crunching along again so I'm completely baffled.
This is what came out of the 1st file:
Name
p2030.20090408.G66.71-01.98.S.b0s0g0.00000.dm_2544_0
Workunit
102181445
Created
5 Aug 2011 2:44:22 UTC
Sent
5 Aug 2011 2:44:28 UTC
Received
5 Aug 2011 3:25:36 UTC
Server state
Over
Outcome
Client error
Client state
Compute error
Exit status
1019 (0x3fb)
Computer ID
4183865
Report deadline
19 Aug 2011 2:44:28 UTC
Run time
8.24
CPU time
5.43
Validate state
Invalid
Claimed credit
0.03
Granted credit
0.00
application version
Binary Radio Pulsar Search v1.00 (BRP3cuda32)
Stderr output
6.12.33
System could not allocate the required space in a registry log. (0x3fb) - exit code 1019 (0x3fb)
Activated exception handling...
[12:58:25][5168][INFO ] Starting data processing...
[12:58:25][5168][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1252 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[12:58:25][5168][INFO ] Using CUDA device #0 "GeForce GTX 460M" (192 CUDA cores / 437.76 GFLOPS)
[12:58:25][5168][INFO ] Version of installed CUDA driver: 4000
[12:58:25][5168][INFO ] Version of CUDA driver API used: 3020
[12:58:25][5168][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[12:58:25][5168][INFO ] Header contents:
------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP4/p2030.20090408.G66.71-01.98.S.b0s0g0.00000/p2030.20090408.G66.71-01.98.S.b0s0g0.00000_DM336.00
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 54929.485525165641
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 200655.683399
------> DEC (J2000): 283307.6752
------> Galactic l: 0
------> Galactic b: 0
------> Name: G66.71-01.98.S
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 336 cm^-3 pc
------> Scale factor: 0.127527
[12:58:29][5168][INFO ] Seed for random number generator is -1019465212.
[12:58:31][5168][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[12:58:31][5168][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 423 MB (1049 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 203 MB
[12:58:31][5168][ERROR] Error during CUDA TSR kernel parameter setup (error: 999)
[12:58:31][5168][ERROR] Demodulation failed (error: 1019)!
12:58:31 (5168): called boinc_finish
]]>
And the last one:
Name
p2030.20090408.G66.71-01.98.S.b0s0g0.00000.dm_2544_0
Workunit
102181445
Created
5 Aug 2011 2:44:22 UTC
Sent
5 Aug 2011 2:44:28 UTC
Received
5 Aug 2011 3:25:36 UTC
Server state
Over
Outcome
Client error
Client state
Compute error
Exit status
1019 (0x3fb)
Computer ID
4183865
Report deadline
19 Aug 2011 2:44:28 UTC
Run time
8.24
CPU time
5.43
Validate state
Invalid
Claimed credit
0.03
Granted credit
0.00
application version
Binary Radio Pulsar Search v1.00 (BRP3cuda32)
Stderr output
6.12.33
System could not allocate the required space in a registry log. (0x3fb) - exit code 1019 (0x3fb)
Activated exception handling...
[12:58:25][5168][INFO ] Starting data processing...
[12:58:25][5168][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1252 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[12:58:25][5168][INFO ] Using CUDA device #0 "GeForce GTX 460M" (192 CUDA cores / 437.76 GFLOPS)
[12:58:25][5168][INFO ] Version of installed CUDA driver: 4000
[12:58:25][5168][INFO ] Version of CUDA driver API used: 3020
[12:58:25][5168][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[12:58:25][5168][INFO ] Header contents:
------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP4/p2030.20090408.G66.71-01.98.S.b0s0g0.00000/p2030.20090408.G66.71-01.98.S.b0s0g0.00000_DM336.00
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 54929.485525165641
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 200655.683399
------> DEC (J2000): 283307.6752
------> Galactic l: 0
------> Galactic b: 0
------> Name: G66.71-01.98.S
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 336 cm^-3 pc
------> Scale factor: 0.127527
[12:58:29][5168][INFO ] Seed for random number generator is -1019465212.
[12:58:31][5168][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[12:58:31][5168][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 423 MB (1049 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 203 MB
[12:58:31][5168][ERROR] Error during CUDA TSR kernel parameter setup (error: 999)
[12:58:31][5168][ERROR] Demodulation failed (error: 1019)!
12:58:31 (5168): called boinc_finish
]]>
Unfortunately alot of this flew over my head... but I'm hoping there is someone out there that knows the cause. If it was just a bad batch, then no issue.
If I'm pushing the laptop too hard though, I'll back off the pressure if I know thats what's causing it.
Domain name
Orion
Local Standard Time
UTC +10 hours
Name
Orion
Created
3 Aug 2011 3:52:22 UTC
Total credit
11,755
Average credit
1,064.93
CPU type
GenuineIntel
Intel(R) Core(TM) i7 CPU Q 840 @ 1.87GHz [Family 6 Model 30 Stepping 5]
Number of processors
8
Coprocessors
NVIDIA GeForce GTX 460M (1471MB) driver: 27533
Operating System
Microsoft Windows 7
Home Premium x64 Edition, Service Pack 1, (06.01.7601.00)
BOINC client version
6.12.33
Memory
8180.5 MB
Cache
256 KB
Swap space
16359.2 MB
Total disk space
581.48 GB
Free Disk Space
528.45 GB
Measured floating point speed
1814.62 million ops/sec
Measured integer speed
8613.34 million ops/sec
Average upload rate
8.28 KB/sec
Average download rate
212.34 KB/sec
Average turnaround time
0.48 days
Maximum daily WU quota per CPU
12/day
Tasks
119
Number of times client has contacted server
143
Last time contacted server
5 Aug 2011 3:25:36 UTC
% of time BOINC client is running
99.3213 %
While BOINC running, % of time host has an Internet connection
99.9919 %
While BOINC running, % of time work is allowed
99.9951 %
Task duration correction factor
1.353221
Copyright © 2024 Einstein@Home. All rights reserved.
Error while Computing - Am I pushing the system too hard?
)
I think it's neither a bad batch of tasks nor (necessarily) an issue with your laptop.
Sometimes while processing a CUDA task, something gets stuck in the GPU memory, which causes the task to error out. Unfortunately, that "bit" remains stuck and causes all subsequent CUDA tasks to error out too, until the computer is rebooted. There is no other way to "unstuck" GPU memory.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Thank you for the quick reply
)
Thank you for the quick reply Gundolf!
I might give the laptop a 30minute break by shutting it down once a day to clear any memory issues - plus it'll give it a chance to cool off.
Either way, it'll be an interesting exercise to see whether the GPU issue is completely random, or whether it can be prevented by a daily shutdown.
I shall come back to the forum with my discoveries... hopefully it might help someone else out there with the same problem :)
Cheers
Jase
Well the pre-emptive shutdown
)
Well the pre-emptive shutdown didn't work... less than 6hrs later the GPU memory locked up again (sadness). So I started doing alittle more research into what can cause a GPU memory issue when you're having a good BOINC session.
While there have been a few posts written on the subject, they seem to boil down to 1 of 3 things:
1) System is running hot enough to cook marshmallows
2) Driver instability issues
3) Boinc version issues with incompatible nVidia drivers
Hmmm...
It was easy to rule issue 1 out... and the 6.12.33 (x64) version of BOINC has been relatively trouble free. So I narrowed my search directly at point 2.
Having bounced around Google, I found a number of posts regarding the 270(ish) versions of nVidia drivers that are suspected of causing either slowdowns in clock speed or a 'sudden' execution of GPU memory - more than capable of de-BOINCing a perfectly good work unit and setting up a string of Computation Errors until it is physically shutdown and rebooted. As a result, alot of users appear to have rolled back to 266.58 - the last major stability point.
Being too stubborn to wind backwards, I went wandering through the nVidia website and found the 280.19 beta drivers that were only released last week. Thinking "it couldn't be any worse...", I downloaded and installed onto the main machine to see whether it would provide a better outcome.
Aside from decreasing the WU times by approximately 6minutes on a Binary Radio Pulsar Search v1.00 (BRP3cuda32), the system itself has been motoring along without a single 'glitch' since. The last official nVidia driver (275.33 WHQL) seemed to be the root of my problems. I'm hoping this will help some of the other users out there that are experiencing similar issues and having their batches rendered useless by the GPU.
Cheers
Jase