2 errors: cause of or by blue screen?

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0
Topic 197427

I got home a couple days ago and discovered that one of my computers had recovered from a blue screen. Of course, Boinc hadn't found the GPU and I had to exit and start it again. Then I discovered that 2 Einstein GPU tasks (the ones running at the time of the blue screen, presumably) returned errors. So I'm wondering if the error was the cause or the effect of the blue screen. This is from one of them, task 423321920. The other, task 423320063, has the same error message in it.

Stderr output
6.10.60

An I/O operation initiated by the registry failed unrecoverably. The registry could not read in, or write out, or flush, one of the files that contain the system's image of the registry. (0x3f8) - exit code 1016 (0x3f8)

Activated exception handling...
[00:41:30][6596][INFO ] Starting data processing...
[00:41:30][6596][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1317 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[00:41:30][6596][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[00:41:30][6596][INFO ] Version of installed CUDA driver: 5050
[00:41:30][6596][INFO ] Version of CUDA driver API used: 3020
[00:41:32][6596][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[00:41:32][6596][INFO ] Header contents:
------> Original WAPP file: ./PA0084_00281_DM308.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 54399.857698562795
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 82423.3250008
------> DEC (J2000): -263613.155
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4516531
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 308 cm^-3 pc
------> Scale factor: 1.62162
[00:41:32][6596][INFO ] Seed for random number generator is 1087967505.
[00:41:33][6596][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[00:41:33][6596][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 341 MB (1196 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 121 MB
[00:42:36][6596][INFO ] Checkpoint committed!
[00:43:41][6596][INFO ] Checkpoint committed!
{a whole lot of checkpoints deleted for brevity}
[05:22:11][6596][INFO ] Checkpoint committed!
[05:23:09][6596][INFO ] Statistics: count dirty SumSpec pages 3387 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[05:23:09][6596][INFO ] Data processing finished successfully!
[05:23:09][6596][INFO ] Starting data processing...
[05:23:09][6596][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1317 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[05:23:09][6596][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[05:23:09][6596][INFO ] Version of installed CUDA driver: 5050
[05:23:09][6596][INFO ] Version of CUDA driver API used: 3020
[05:23:10][6596][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[05:23:10][6596][INFO ] Header contents:
------> Original WAPP file: ./PA0084_00281_DM310.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 54399.857698521024
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 82423.3250008
------> DEC (J2000): -263613.155
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4516531
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 310 cm^-3 pc
------> Scale factor: 1.62162
[05:23:11][6596][INFO ] Seed for random number generator is 1091183138.
[05:23:12][6596][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[05:23:12][6596][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 341 MB (1196 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 121 MB
[05:23:17][6596][INFO ] Checkpoint committed!
[05:24:22][6596][INFO ] Checkpoint committed!
{more checkpoints deleted}
[06:14:36][6596][INFO ] Checkpoint committed!
[06:15:42][6596][INFO ] Checkpoint committed!
[06:16:01][6596][ERROR] Error during CUDA host->device HS thresholds data transfer (error: 999)
[06:16:01][6596][ERROR] Demodulation failed (error: 1007)!
06:16:01 (6596): called boinc_finish
Activated exception handling...
[08:56:00][5828][INFO ] Starting data processing...
[08:56:00][5828][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 80 MB (1457 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[08:56:00][5828][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[08:56:00][5828][INFO ] Version of installed CUDA driver: 5050
[08:56:00][5828][INFO ] Version of CUDA driver API used: 3020
[08:56:00][5828][ERROR] Couldn't load main CUDA device module (error: 301)!
[08:56:00][5828][ERROR] Demodulation failed (error: 1016)!
08:56:00 (5828): called boinc_finish

]]>

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

2 errors: cause of or by blue screen?

You need to look in Windows event log and compare the timestamps or at least dig out what the error in the blue screen was to be able to make any kind of guess about the task errors being the cause or effect of the blue screen.

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 508056914
RAC: 57118

Well, this happens here also

Well, this happens here also from time to time, about once in 2 weeks.
Last time it happend I saw something like 'interrupt equal or less' on the screen before the system rebootet and one cuda wu was destroyed. It never destroyed a ATI wu, sometimes a cuda wu but always the WU-prop wu's.
I thought it has to do with my DVB-S Tuner card because it happens only on this system, so I ignored it.
Different cuda drivers, happened on the earlier and the actual system.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 16

Use Blue Screen View to see

Use Blue Screen View to see what the blue screen said, and where it came from.

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 508056914
RAC: 57118

RE: Use Blue Screen View to

Quote:
Use Blue Screen View to see what the blue screen said, and where it came from.

THX for the link, I used it and found:
https://dl.dropboxusercontent.com/u/50246791/Crashdump1.PNG

Google results point to a memory fault, where 'memory' means more than than RAM, it can also be GPU ram, harddisk aso.

As far as my system is concerned, it might be something like an 'intellectual overload' for windows; 3 different types of GPU's, a tuner card using the ram for timeshift memory and usually 5-8 programs open.

There was a discussion about the needed amount of ram for best performance. My system has 8GB, so this might not be enough.

Alexander

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

RE: Use Blue Screen View to

Quote:
Use Blue Screen View to see what the blue screen said, and where it came from.


Okay, I did. It says this BSOD (and the last one, in November) was caused by dxgkrnl.sys. I Binged that and got lots of discussion of it. I'm still reading, but nothing seems entirely pertinent so far. The gist of the answers is that I have a bad video driver (doesn't seem to matter whether it's ATI or NVidia) and I should either update it or roll it back. I'll do some more reading and probably try updating the driver when I get home today. (Or maybe I'll wait until Boinc finishes all the Einstein work on hand, which is dangerously close to deadline.)

I can tell you that no one was actively using the computer at the time of the crash. It may have been as long as a month since I laid hands on it; I spend an average of about 30-40 minutes a day (almost daily) on it via Teamviewer, but other than that it sits there crunching.

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 16

Well, if it all turns out to

Well, if it all turns out to be nothing, you can always try to update DirectX. dxgkrnl.sys, DirectX Graphics Kernel?

How to update DirectX? See here. It may be that there's just a glitch that a reinstall will fix. The videocard drivers usually don't update the DirectX environment, although new game installations will do that.

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

Off my own topic... I have

Off my own topic...

I have at least two tasks that are not going to make deadline. There are five that are due less than 12 hours 40 minutes from now. Two are crunching, showing 4:53 and 5:39 to completion. One is waiting, showing 17 minutes left. This one and this one have not started. They've all been taking roughly 9.5 hours. so there's no way it can finish what it started and get through what it hasn't started in the time remaining.

And there are 12 more due at various times in the ensuing 48 hours.

I'm going to get off Teamviewer and let it run. I'll leave the driver check until these are over with one way or the other.

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

If I were in that situation I

If I were in that situation I would consider aborting the not started tasks and focus on the ones that might actually make it before a resend is sent out or at least before it comes back in again. I would also check my cache setting so I don't end up in the same situation again. =)
I think the server here is configured to send a message to abort not started and unneeded tasks so if the resend hasn't begun processing on your wingman's machine it should be aborted at the next scheduler contact.

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

RE: Off my own topic... I

Quote:

Off my own topic...

I have at least two tasks that are not going to make deadline. There are five that are due less than 12 hours 40 minutes from now. Two are crunching, showing 4:53 and 5:39 to completion. One is waiting, showing 17 minutes left. This one and this one have not started. They've all been taking roughly 9.5 hours. so there's no way it can finish what it started and get through what it hasn't started in the time remaining.

And there are 12 more due at various times in the ensuing 48 hours.

I'm going to get off Teamviewer and let it run. I'll leave the driver check until these are over with one way or the other.


Holmis, I considered that, but I let them go.

Of the two noted above that hadn't started, one is now out to a third host uselessly (sorry) and the other has its third task marked as "didn't need."

I just aborted four more that were due in the next two hours and hadn't started yet. There are at least two more that I probably should abort, but I'll hold off and see what happens with the ones currently running. Actually, those two can't possibly make it either...

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

I aborted two more that

I aborted two more that hadn't started and are due in under four hours from now.

I see that one of the ones I've aborted is within an hour of timing out from the other user as well.

Also, one of the ones still running (even thought it already timed out) previously had a timeout and an error.

Anyway, they should all be done by the time I get home from work tomorrow, and then I can get back to my original topic and try updating my video driver. I probably need to blow the dust out of the computer, too.

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.