Error while Computing - Am I pushing the system too hard?

Darkstone751
Darkstone751
Joined: 3 Aug 11
Posts: 3
Credit: 1058377
RAC: 0
Topic 195894

Hey Guys!

I've been happily crunching my way through data without an issue - up until 30minutes ago. Dropped by to check how things were progressing, and an entire bank (20) of Binary Radio Pulsar Search v1.00 (BRP3cuda32) had failed within the first 10seconds of processing.

Being a newbie to Einstein@home, is this a bad batch of files, or a processing issue? Upon seeing them I shutdown the laptop for a few minutes and restarted - everything seems to be happily crunching along again so I'm completely baffled.

This is what came out of the 1st file:

Name
p2030.20090408.G66.71-01.98.S.b0s0g0.00000.dm_2544_0

Workunit
102181445

Created
5 Aug 2011 2:44:22 UTC

Sent
5 Aug 2011 2:44:28 UTC

Received
5 Aug 2011 3:25:36 UTC

Server state
Over

Outcome
Client error

Client state
Compute error

Exit status
1019 (0x3fb)

Computer ID
4183865

Report deadline
19 Aug 2011 2:44:28 UTC

Run time
8.24

CPU time
5.43

Validate state
Invalid

Claimed credit
0.03

Granted credit
0.00

application version

Binary Radio Pulsar Search v1.00 (BRP3cuda32)


Stderr output
6.12.33

System could not allocate the required space in a registry log. (0x3fb) - exit code 1019 (0x3fb)

Activated exception handling...
[12:58:25][5168][INFO ] Starting data processing...
[12:58:25][5168][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1252 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[12:58:25][5168][INFO ] Using CUDA device #0 "GeForce GTX 460M" (192 CUDA cores / 437.76 GFLOPS)
[12:58:25][5168][INFO ] Version of installed CUDA driver: 4000
[12:58:25][5168][INFO ] Version of CUDA driver API used: 3020
[12:58:25][5168][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[12:58:25][5168][INFO ] Header contents:
------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP4/p2030.20090408.G66.71-01.98.S.b0s0g0.00000/p2030.20090408.G66.71-01.98.S.b0s0g0.00000_DM336.00
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 54929.485525165641
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 200655.683399
------> DEC (J2000): 283307.6752
------> Galactic l: 0
------> Galactic b: 0
------> Name: G66.71-01.98.S
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 336 cm^-3 pc
------> Scale factor: 0.127527
[12:58:29][5168][INFO ] Seed for random number generator is -1019465212.
[12:58:31][5168][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[12:58:31][5168][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 423 MB (1049 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 203 MB
[12:58:31][5168][ERROR] Error during CUDA TSR kernel parameter setup (error: 999)
[12:58:31][5168][ERROR] Demodulation failed (error: 1019)!
12:58:31 (5168): called boinc_finish

]]>

And the last one:

Name
p2030.20090408.G66.71-01.98.S.b0s0g0.00000.dm_2544_0

Workunit
102181445

Created
5 Aug 2011 2:44:22 UTC

Sent
5 Aug 2011 2:44:28 UTC

Received
5 Aug 2011 3:25:36 UTC

Server state
Over

Outcome
Client error

Client state
Compute error

Exit status
1019 (0x3fb)

Computer ID
4183865

Report deadline
19 Aug 2011 2:44:28 UTC

Run time
8.24

CPU time
5.43

Validate state
Invalid

Claimed credit
0.03

Granted credit
0.00

application version

Binary Radio Pulsar Search v1.00 (BRP3cuda32)


Stderr output
6.12.33

System could not allocate the required space in a registry log. (0x3fb) - exit code 1019 (0x3fb)

Activated exception handling...
[12:58:25][5168][INFO ] Starting data processing...
[12:58:25][5168][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1252 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[12:58:25][5168][INFO ] Using CUDA device #0 "GeForce GTX 460M" (192 CUDA cores / 437.76 GFLOPS)
[12:58:25][5168][INFO ] Version of installed CUDA driver: 4000
[12:58:25][5168][INFO ] Version of CUDA driver API used: 3020
[12:58:25][5168][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[12:58:25][5168][INFO ] Header contents:
------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP4/p2030.20090408.G66.71-01.98.S.b0s0g0.00000/p2030.20090408.G66.71-01.98.S.b0s0g0.00000_DM336.00
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 54929.485525165641
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 200655.683399
------> DEC (J2000): 283307.6752
------> Galactic l: 0
------> Galactic b: 0
------> Name: G66.71-01.98.S
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 336 cm^-3 pc
------> Scale factor: 0.127527
[12:58:29][5168][INFO ] Seed for random number generator is -1019465212.
[12:58:31][5168][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[12:58:31][5168][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 423 MB (1049 MB free / 1472 MB total) -> Used by this application (assuming a single GPU task): 203 MB
[12:58:31][5168][ERROR] Error during CUDA TSR kernel parameter setup (error: 999)
[12:58:31][5168][ERROR] Demodulation failed (error: 1019)!
12:58:31 (5168): called boinc_finish

]]>

Unfortunately alot of this flew over my head... but I'm hoping there is someone out there that knows the cause. If it was just a bad batch, then no issue.

If I'm pushing the laptop too hard though, I'll back off the pressure if I know thats what's causing it.

Domain name
Orion

Local Standard Time
UTC +10 hours

Name
Orion

Created
3 Aug 2011 3:52:22 UTC

Total credit
11,755

Average credit
1,064.93

CPU type
GenuineIntel
Intel(R) Core(TM) i7 CPU Q 840 @ 1.87GHz [Family 6 Model 30 Stepping 5]

Number of processors
8

Coprocessors
NVIDIA GeForce GTX 460M (1471MB) driver: 27533

Operating System
Microsoft Windows 7
Home Premium x64 Edition, Service Pack 1, (06.01.7601.00)

BOINC client version
6.12.33

Memory
8180.5 MB

Cache
256 KB

Swap space
16359.2 MB

Total disk space
581.48 GB

Free Disk Space
528.45 GB

Measured floating point speed
1814.62 million ops/sec

Measured integer speed
8613.34 million ops/sec

Average upload rate
8.28 KB/sec

Average download rate
212.34 KB/sec

Average turnaround time
0.48 days

Maximum daily WU quota per CPU
12/day

Tasks
119

Number of times client has contacted server
143

Last time contacted server
5 Aug 2011 3:25:36 UTC

% of time BOINC client is running
99.3213 %

While BOINC running, % of time host has an Internet connection
99.9919 %

While BOINC running, % of time work is allowed
99.9951 %

Task duration correction factor
1.353221

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

Error while Computing - Am I pushing the system too hard?

I think it's neither a bad batch of tasks nor (necessarily) an issue with your laptop.

Sometimes while processing a CUDA task, something gets stuck in the GPU memory, which causes the task to error out. Unfortunately, that "bit" remains stuck and causes all subsequent CUDA tasks to error out too, until the computer is rebooted. There is no other way to "unstuck" GPU memory.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Darkstone751
Darkstone751
Joined: 3 Aug 11
Posts: 3
Credit: 1058377
RAC: 0

Thank you for the quick reply

Thank you for the quick reply Gundolf!

I might give the laptop a 30minute break by shutting it down once a day to clear any memory issues - plus it'll give it a chance to cool off.

Either way, it'll be an interesting exercise to see whether the GPU issue is completely random, or whether it can be prevented by a daily shutdown.

I shall come back to the forum with my discoveries... hopefully it might help someone else out there with the same problem :)

Cheers
Jase

Darkstone751
Darkstone751
Joined: 3 Aug 11
Posts: 3
Credit: 1058377
RAC: 0

Well the pre-emptive shutdown

Well the pre-emptive shutdown didn't work... less than 6hrs later the GPU memory locked up again (sadness). So I started doing alittle more research into what can cause a GPU memory issue when you're having a good BOINC session.

While there have been a few posts written on the subject, they seem to boil down to 1 of 3 things:

1) System is running hot enough to cook marshmallows
2) Driver instability issues
3) Boinc version issues with incompatible nVidia drivers

Hmmm...

It was easy to rule issue 1 out... and the 6.12.33 (x64) version of BOINC has been relatively trouble free. So I narrowed my search directly at point 2.

Having bounced around Google, I found a number of posts regarding the 270(ish) versions of nVidia drivers that are suspected of causing either slowdowns in clock speed or a 'sudden' execution of GPU memory - more than capable of de-BOINCing a perfectly good work unit and setting up a string of Computation Errors until it is physically shutdown and rebooted. As a result, alot of users appear to have rolled back to 266.58 - the last major stability point.

Being too stubborn to wind backwards, I went wandering through the nVidia website and found the 280.19 beta drivers that were only released last week. Thinking "it couldn't be any worse...", I downloaded and installed onto the main machine to see whether it would provide a better outcome.

Aside from decreasing the WU times by approximately 6minutes on a Binary Radio Pulsar Search v1.00 (BRP3cuda32), the system itself has been motoring along without a single 'glitch' since. The last official nVidia driver (275.33 WHQL) seemed to be the root of my problems. I'm hoping this will help some of the other users out there that are experiencing similar issues and having their batches rendered useless by the GPU.

Cheers
Jase

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.