Hi all,
Old-timer and newbie at the same time on E@H, came back after a loooong time to crunch on my NVIDIA GT 440, after WCG stopped serving GPU WUs.
Just started yesterday, but I am seeing a trend I don't like at all: my tasks run and complete fine, but they are then marked invalid. Here is the list of invalid tasks.
I know it's too soon at this point to draw any conclusions, with so few WUs completed, but as I said, I don't like the trend. I am suspecting the card, but it was working like a charm for WCG!
Perhaps it is the temps. I have a fairly good setup, with fans blowing both in and out of the case, a decent CPU cooler and the card's (GIGABYTE) stock active cooler. Here are my temps:
vagelis@vgserver:~$ sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +63.0°C (high = +83.0°C, crit = +99.0°C) Core 1: +62.0°C (high = +83.0°C, crit = +99.0°C) Core 2: +61.0°C (high = +83.0°C, crit = +99.0°C) Core 3: +63.0°C (high = +83.0°C, crit = +99.0°C)
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage: +1.22 V (min = +0.80 V, max = +1.60 V)
+3.3V Voltage: +3.39 V (min = +2.97 V, max = +3.63 V)
+5V Voltage: +5.14 V (min = +4.50 V, max = +5.50 V)
+12V Voltage: +12.26 V (min = +10.20 V, max = +13.80 V)
CPU Fan Speed: 2280 RPM (min = 600 RPM)
Chassis1 Fan Speed: 1504 RPM (min = 600 RPM)
Chassis2 Fan Speed: 0 RPM (min = 600 RPM)
Power Fan Speed: 610 RPM (min = 0 RPM)
CPU Temperature: +63.5°C (high = +45.0°C, crit = +45.5°C)
MB Temperature: +35.0°C (high = +45.0°C, crit = +46.0°C)
vagelis@vgserver:~$ nvidia-smi -q -d TEMPERATURE
==============NVSMI LOG==============
Timestamp : Fri May 10 00:17:44 2013
Driver Version : 304.88
Attached GPUs : 1
GPU 0000:01:00.0
Temperature
Gpu : 64 C
It's right after midnight now that I'm writing these lines, so ambient temp has dropped a few degrees. When it's warmer, they tend to be about 70 for the CPU, slightly lower for the GPU (maybe 67-69) and 37-39 for the MB.
I am running on Ubuntu 12.04 with BOINC and NVIDIA drivers from the repos (7.0.27 and 304.88 respectively). Here are the machine's details.
Do you think my card is bad? Or perhaps I'm just piling up all my bad results right from the start and things will get smooth later?
Thanks and regards,
Vagelis
Copyright © 2024 Einstein@Home. All rights reserved.
Invalid Tasks
)
Only you can see that list, we can't. But we can see it here, using hostid instead of userid.
No, I think you're sitting on a bad GPU. Though it could be something easy, such as dust built-up, or a bad seating. Take it out, dust it off, put it back in and seat it correctly. Check capacitors while it's out, check that none are bulging, leaking or burnt. Check power cords.
If you have the possibility, try it in another slot.
Or if you have a spare GPU, try that one.
Hi Jord, Thanks for your
)
Hi Jord,
Thanks for your response! I checked my tasks this morning and the trend continues: all my completed tasks are marked invalid.
I am wondering if there is some indication of whatever might be wrong in the tasks' stdout / stderr outputs. Something that I could compare for a given WU between my task that is found invalid and another that is found valid.
Can't see anything obvious in
)
Can't see anything obvious in your scheduler logs, and the runtime for the jobs is in the expected range. I do recall I had a few problems around boinc-7.0.27, so one possibility would be to upgrade to the current 7.0.65 release.
RE: I am wondering if there
)
Excellent idea, in fact that's one reason why such logs exist. I'll have a look when I get home. :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Thank you Neil and Mike for
)
Thank you Neil and Mike for your responses! Mike, it would be sooo nice if you managed to take a look at those logs! Thank you in advance!
In my attempts to resolve this, I tried detaching / reattaching the project, resetting the project and even rebooting my PC. The detach / reattach caused a new computer ID to be generated, 7225470. I went ahead and merged the two computer entries.
All this, I am afraid, to no avail! A new WU was completed and also found invalid!
Now, I know my poor little GT 440 wouldn't make the slightest difference to the progress of the project if it actually did produce valid results, but it is SOOO depressing to see your WUs flagged invalid! :(
RE: I am running on Ubuntu
)
I can´t see if you are still running those versions, but i find running NVIDIA from the repos gives me problems (Ubuntu 10.04), and if i´m not paying attention at update times i overwrite my current 310.14, which have served me well, with 304.xx which either invalids or errors.
Nvidia would be the place to go to get later versions.
I notice 319.17 is the latest, so i may give that a try and report back.
HTH
Maybe I have found something.
)
Maybe I have found something. I downloaded and installed the latest driver from the Nvidia site (319.17). Nouveau did make me reboot my server +1 time, which takes some time with the RAID and other stuff I have on it, but I managed to load the latest driver and launch X successfully.
I then fired up BOINC and the manager to see whether the E@H WU would start and it did. Looking through the BOINC logs, I noticed this:
NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.50, compute capability 2.1, 134214656MB, 134214625MB available, 319 GFLOPS peak)
With the older driver, this was like so:
NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.0, compute capability 2.1, 134214656MB, 134214626MB available, 319 GFLOPS peak)
Notice the CUDA version difference, was 5.0, now is 5.50.
I then looked at the WUs logs:
So I am wondering, could it be that E@H requires CUDA 5.50 and didn't work correctly with 5.0??
RE: I then fired up BOINC
)
BRP4 requirements suggest 5.0 is ok.
i would let it run and see what happens. Any inprogress WUs may still error, so i would abort any that were in progress during the upgrade.
134214656MB - that is large for the GPU! I think something is not reporting the memory size of the GPU correctly - and i seem to recall it is a known bug.
And my gtx460s on 319.17 has crunched a couple of WUs ok. No obvious performance gains although hard to tell at first glance.
RE: 134214656MB - that is
)
It is a known bug and fixed in 7.0.65, but last I checked 7.0.65 isn't available yet in Ubuntu repos.
RE: RE: 134214656MB -
)
I will give the latest BOINC version a try.