Well my brand new GTX 670 has started to produce nothing but compute errors
They seem to happen anywhere from 35 to 120 seconds into the task.
The ones I've looked at all say something like:
[20:13:34][4633][INFO ] CUDA global memory status (GPU setup complete): ------> Used in total: 1717 MB (331 MB free / 2048 MB total) -> Used by this application (assuming a single GPU task): 620 MB [20:14:32][4633][INFO ] Checkpoint committed! [20:15:31][4633][ERROR] Error during CUDA device->host HS power spectrum transfer (error: 702) [20:15:31][4633][ERROR] Demodulation failed (error: 1008)! 20:15:31 (4633): called boinc_finish
Here are some examples:
[url]
http://einsteinathome.org/task/312162146[/url]
http://einsteinathome.org/task/312162145
http://einsteinathome.org/task/312161996
For the record I'm on this system http://einsteinathome.org/host/5771544 running Scientific Linux (Red Hat derivative) nvidia drivers from the repository v 304.43
That sounds like a hardware or driver problem to me. So I've slapped together a little Matlab script that:
- Generates 4096x4096 arrays of random doubles
- Copies it to GPU memory, reads back and compares the two
- Does an fft on the array (1D row wise) in memory and in gpu, then does an ifft, takes the amplitude and generates an rms error between the original and result. It compares cpu results with cpu results and gpu with gpu.
All of that works just fine.
The memory transfers compare without errors and the average errors are very similar for cpu (1.51e-13) and gpu (1.39e-13). Errors for single precision values CPU 8.14e-5 and GPU 7.65e-5
Timing for the DP FFTs
[pre]
FFT iFFT AMP RMS
CPU 8.30 14.40 9.25 12.73
GPU 0.97 0.95 1.06 1.96
[/pre]
Timing for Single precision
[pre]
FFT iFFT AMP RMS
CPU 6.16 11.31 4.80 8.23
GPU 0.27 0.27 1.06 0.78
[/pre]
The FFT and iFFT are done 100 times row wise on a 4096x4096 matrix.
AMP is converting the complex ifft result back to real with sqrt(xi.*conj(xi))
RMS is mean of the row root mean square error of the original - result.
Any ideas what else I can test?
Joe
Copyright © 2024 Einstein@Home. All rights reserved.
[SOLVED] CUDA Error - Exit Code 240
)
Well, I'm getting more information although I'm still confused.
My home setup is dual monitors and multiple computers on separate switch boxes.
I disconnected one monitor from that system and didn't switch the other and lo and behold the 3 running CUDA jobs finished successfully.
Now I'm trying to leave one disconnected but switch the other to a different system to see what happens.
I did notice that the latest drivers detect monitor connect and disconnects on the fly.
Anybody know how to turn that off?
Joe
OK so with only one monitor
)
OK so with only one monitor connected I can switch back and forth to that system with CUDA tasks running and they complete without error. We'll see if they also validate but I expect they will.
Just goes to prove "one man's signal is another man's noise" or is it "one man's feature is another man's bug".
Anyway this fancy new driver of NVIDIA 304.43 should not be used with dual monitors if you're switching video.
At least I got some CUDA diagnostic and timing tests started. Anybody with Matlab and the Parallel Toolbox who is interested in a short script is welcome to it.
Joe
They are validating. Is
)
They are validating.
Is there a way to add the [SOLVED] tag to the thread title?
Joe
RE: Is there a way to add
)
Yes.
Post a new message to this thread. When editing that message within the hour, you can change the thread title too.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: RE: Is there a way to
)
Thanks Gundolf!
A new message to try to
)
A new message to try to indicate the issue was resolved.
As Gundolf said, post a new message, not reply or quote an existing message.