One of my two GTX660 hosts has been generating quite a few errors on Perseus Arm Survey tasks lately. I assume this is a problem with the host, and not somehow a problem with Einstein code interacting with Einstein data.
At the moment the task list for that host shows 25 error tasks, I spot checked six, and for all of them the stderr_txt has just before the end two lines very like this example
[01:43:07][768][ERROR] Error during CUDA device->host HS data transfer (error: 999) [01:43:07][768][ERROR] Demodulation failed (error: 1008)!
save that on one of the six the data transfer error line showed error 702 rather than 999.
Just since the latest error I installed the latest NVidia driver 327.23, in place of 320.49.
A day or two ago I downgraded from running 3 simultaneous GPU jobs to two.
Any suggestions, health checks, etc. I'd be glad to hear. It seems likely to me that somewhere in the system some piece of hardware is flakey--possibly the GPU card itself, or something on the motherboard, or the power supply, or ...
The other host has different CPU and motherboard, but has considerable similarity to this one (64 bit Windows 7 installed by me), and no such difficulty.
I'd like to try a modest downclock of the GPU, but have had trouble exercising clock rate control of this card in the past. I currently have OC_Guru II V1.20 installed. I either don't understand the interface, or it mostly ignores my clock rate advice, though it does seem to apply at least some of my fan speed requests.
GPU-Z current reports GPU Core Clock of 1136.6 MHz, Memory clock of 1502.3 MHZ, VDDC of 1.175, and temperature of 60 C.
I run TThrottle, though as the outside temperature has dropped it is not throttling back nearly so much as through the summer.
Copyright © 2024 Einstein@Home. All rights reserved.
host HS data transfer error
)
I had a quick look at the CUDA docs and the BRP source code; the "999" error code is CUDA_UNKNOWN_ERROR which "indicates that an unknown internal error has occurred". So it does seem likely it's a hardware issue, with the card not behaving as expected.
It's around line 621 of 'src/cuda/app/demod_binary_hs_cuda.cu' in the current source (linked from the home page).
Neil, thanks for the
)
Neil, thanks for the interpretation.
I now have over three days running time with 327.23 drivers, only two simultaneous GPU tasks, and possibly a modest clock rate reduction. In that time zero Einstein GPU tasks have errored out, but on the first day there seemed to be a case of modest downclocking (not the severe downclocking associated with the GPU task error out events).
I say possible slight downclocking, as the indications are mixed. GPU-Z asserts that the GPU clock has been 1136.6 MHz both min and max for days now, while it has shown 1149 in the past. OC_Guru II currently displays a commanded base/boost GPU clock of 1027/1092, with -5 showing in the adjustment window below, but in the monitoring window reports observing 1136 GPU and 6008 memory.
On the other hand, the second host, which has not had these problems, and on which I have not attempted to get GPU clock rate reduction by OC_Guru II, shows the same GPU-Z detected speed of 1136.6, and the same OC_Guru monitored speed of 1136, while the command window shows 1032/1097 over zero adjust.
I'm very happy that it has been stable for many times the mean time to failure on the previous six failures. I plan to let it alone for more than a week, and then perhaps try to put it back up to three simultaneous tasks.
These kinds of issues can be
)
These kinds of issues can be incredibly fickle, I've had situations when even tiny changes make faults appear (or disappear). Just like you're doing, making a change and waiting long enough to see what effect it has makes sense!
By the way, I missed the '702' error on first reading - the CUDA docs explain that as a "LAUNCH_TIMEOUT", basically saying the card was set up and started, but took too long to come back with a result. So it seems to fit in with the idea that the card is (or was) having internal problems - for whatever reason.
Mostly in case someone finds
)
Mostly in case someone finds this thread while looking for clues on their own problems, I'll give an update.
Since my October 4 update to the 327.23 drivers I have had zero recurrences of the full-fledged problem of three WUs aborting simultaneously, all showing host HS data transfer error, with attendant severe downclocking of the GPU until rebooted.
It is possible that the drivers helped. It is also possible that my attempts to get slightly lower GPU clock rate from OC Guru II settings enjoyed more success in this period, as it usually continued to be the case that on inspection GPU-Z reported 1136.6 GPU clock rate, not 1149.
However I did at least two or three times experience milder forms of sudden major downclocking after the driver update. In these cases the current GPU work did not abort, and I think the units in question eventually validated. However interaction by the user with the PC became a bit sluggish in some respects, and the rate of progress on GPU jobs slowed markedly. I never remembered to examine GPU-Z in these cases, so can not provide confirmation of downclocking, nor a number, from that source. From behavior and WU progress, I hazard a guess it dropped by a factor between 2 and 4.
I nevertheless decided to explore an alternative utility for GPU fan speed and clock rate control. I'm currently trying MSI Afterburner on the troubled host mentioned in my first post on this thread.
So far, on a couple of days running, and a couple of reboots, MSI Afterburner appears to allow me actual GPU clock speed control, with a granularity of about 13 MHz, and fan speed control, and control appears so far to resume properly on reboot (I'm launching the application slightly after boot using a scheduled task, not relying on the program mechanism).
I'm currently planning to inch up the GPU clock rate about one 13 MHz notch per week so long as all seems well, but back down at any trouble. If Afterburner continues to seem superior to OC Guru II for my purpose and environment, I'll also install it on my other host with the same GPU, and explore clock rate there as well.
Bottom line: driver 327.23 seems to have been better for me than 320.49 for massive downclock and aborted GPU work, and MSI Afterburner seems so far to be better for me than OC Guru II.