I'm getting the errors again at least in Linux. I've been running Windows for a bit and it seems to doing OK with these CUDA tasks.
I found one program that reports a temperature for the GPU seems to operate about 65C when processing and about 35 when idle. For the life of me I can't see any reason why NVidia software doesn't report temperature.
Are you referring to the Nvidia software in Linux?
The nvidia-settings GUI is a bit fickle. It may not work for you. Try copying and pasting the following command into a Linux terminal, don't delete the quotes. If the command fails then please copy and paste the output here. If it runs you'll see the GPU temp and fan speed displayed and updated every 2 seconds.
The temp will be on the top;the fan speed will be below.
Quote:
If it's a hardware failure it sure is particular. I can do accelerated 3D rendering, video decoding and FFTs without a problem.
Maybe those activities do not work the GPU in the same way Einstein tasks do. Or maybe they don't raise the temperature the way Einstein tasks do. My GPU temp went through the roof on its first CUDA task because the fan didn't speed up on Linux. Maybe yours is doing the same and that's why it's failing on Linux but not Windows.
Name PM0147_034B1.dm_120_1
Workunit 100178936
Created 3 Jul 2011 6:31:47 UTC
Sent 3 Jul 2011 21:18:17 UTC
Received 3 Jul 2011 21:27:34 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -226 (0xffffffffffffff1e)
It's the right time and the names match but the "zero status" doesn't look right.
If those names match is it the right task?
Yes, if the names match it's the same task. The "zero status" and "Exit status -226" appear to disagree but that's because they come from 2 different sources. The "zero status" is the exit status reported by the Einstein application to BOINC client. The client relays that status to the manager which then includes it in the Messages.
The "Exit status -226" is the exit status the client gives to the task. Notice I am implying the task and the Einstein app are distinct. You can look up task exit codes in the BOINC FAQ (see the link in my sig) to see what they mean. For -226 you would come to this page where we see:
Quote:
ERR_TOO_MANY_EXITS -226
An application has exited prematurely (unexpectedly) more than 99 times without generating a checkpoint, so giving up on that task.
So now the question is "why did the Einstein app exit prematurely?" One very common reason is that the client or OS gets busy with some other job for more than 30 seconds and doesn't communicate with the app during that time. Then the app thinks BOINC has crashed so it exits without writing its finished file. When BOINC gets "unbusy" it sees that the app has exited without writing its finished file (the signal the app is finished crunching the task) so it starts the app again. This sometimes goes on for more than 99 times at which point BOINC kills the app and gives the task a -226 exit status. If it's due to the client not communicating with the app for 30 secs then you'll also see the message "no heartbeat from client" in the stderr which is the lower portion of the page you linked to here.
I don't see any "no heartbeat from client" messages on that page so I assume the app is exiting for some other reason. The reason could be this excerpt which does repeat many times in the stderr:
[ERROR] Couldn't allocate 25163828 bytes of CUDA HSP texture memory (error: 2)!
[ERROR] Demodulation failed (error: 1006)!
[WARN ] CUDA memory allocation problem encountered!
------> Returning control to BOINC, delaying restart for at least five minutes...
Maybe your video card doesn't have 25163828 bytes of CUDA HSP texture memory ? Maybe a bug in the Linux driver prevents the OS/BOINC/app from seeing that much HSP texture memory? Have you tried other drivers? The more I think about it that error probably isn't due to high temperature because it seems to be occurring before the crunching starts (the app has to allocate memory before it can crunch numbers).
Now that you mention it I did update to the latest driver around the time this started to happen. I needed it to run Ubuntu 11.04. I'll see if I can find an older driver that works.
The issues with getting temperature out the darn thing is frustrating.
If I use "nvidia-settings -t -q [gpu:0]/GPUCoreTemp" it returns without outputting anything.
The nvidia-settings utility has been a tad quirky in my experience, same with nvidia-smi. They seem to work slightly different depending on the driver. Let us know how things go with a different driver.
As far as I can tell everything else works fine but Boinc is the only Cuda app I run besides some toying I've been doing to learn how to address it and some video processing in an unrelated project.
I believe I need at least version 270 these are the relatively current drivers available
Name Version Release Date
Linux x64 (AMD64/EM64T) Display Driver NVIDIA Recommended 270.4106 April 20, 2011
Linux x64 (AMD64/EM64T) Display Driver 260.1944 March 7, 2011
Linux x64 (AMD64/EM64T) Display Driver BETA 270.26 February 21, 2011
Linux x64 (AMD64/EM64T) Display Driver 260.1936 January 21, 2011
Linux x64 (AMD64/EM64T) Display Driver BETA 270.18 January 21, 2011
Linux x64 (AMD64/EM64T) Display Driver 260.1929 December 13, 2010
Any recommendation on which might work better than 270.4106?
I recently downgraded to 260.1944 from 270.4106. With 270.4106 nvidia-smi output N/A for most of the attributes including fan speed and temp. It works fine with 260.1944. Those are the only ones I've tried. 260.1944 is adequate for Einstein and GPUgrid, not sure about the other stuff you're working on.
Well now I remember why I felt I HAD to upgrade the NVidia drivers
with the 260 drivers 11.04 doesn't boot for me. I get:
This server has a video driver ABI version of 10.0 that is not
supported by this NVIDIA driver. Please check
http://www.nvidia.com/ for driver updates or downgrade to an X
server with a supported driver ABI.
So I'm not sure what to do.
This laptop doesn't do that much crunching per day through the GPU.
Can you point me to instructions to disable GPU tasks for this machine?
Hmmm. There may be a problem. I've searched through some of the archive at the NVIDIA site and so far I haven't found a Linux driver recommended for your card.
From an image of GPUz running from one of your earlier posts I gather your card is model NVS3100M with GT218 chip. From there I went to NVIDIA's UNIX Driver Portal Page where I clicked the green "Archive " link at the bottom of the "Linux x86/IA32" section. On that page each driver version is listed under a bold black "Linux Display Driver - x86" link. Clicking any of those links takes you to a page with a tab named "Supported Products". I haven't tried every driver but so far the ones I've looked at (270.4106 and 260.1944) do not list NVS3100M or GT218 as a supported product. That's probably why nvidia-settings and nvidia-smi gave such wierd output with 270.4106.
Hopefully one of the other drivers is recommended. If not then it looks like you'll be able to use that GPU only on Windows.
Well now I remember why I felt I HAD to upgrade the NVidia drivers
with the 260 drivers 11.04 doesn't boot for me.
$%#*& I was going to point you to a procedure to follow that would allow you to back out the driver if X server refused to boot. Sorry, I forgot.
Quote:
I get:
This server has a video driver ABI version of 10.0 that is not
supported by this NVIDIA driver. Please check
http://www.nvidia.com/ for driver updates or downgrade to an X
server with a supported driver ABI.
So I'm not sure what to do.
I'm not sure what "video driver ABI" means and I'm not sure how to downgrade the X server.
If you're left at a command prompt or terminal login then you can uninstall the driver by running nvidia-uninstall as root. If you can't get to a ccommaand prompt then you might try booting with your Ubuntu install disk and see if you can do a rescue or boot to command line or something. Hopefully it'll mount the drive for you so you can run nvidia-uninstall. If that doesn't work the only thing I can recommend is reinstalling Ubuntu.
Quote:
This laptop doesn't do that much crunching per day through the GPU.
Can you point me to instructions to disable GPU tasks for this machine?
Not sure what you mean. You can go to Einstein preferences on you account page and deselect CUDA tasks. Or do you mean boot the OS in such a way that it won't use the GPU? No, I don't think you can do that. Or do you mean something else?
RE: This is confusing the
)
Are you referring to the Nvidia software in Linux?
The nvidia-settings GUI is a bit fickle. It may not work for you. Try copying and pasting the following command into a Linux terminal, don't delete the quotes. If the command fails then please copy and paste the output here. If it runs you'll see the GPU temp and fan speed displayed and updated every 2 seconds.
watch -t "nvidia-settings -t -q [gpu:0]/GPUCoreTemp && nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeed"
The temp will be on the top;the fan speed will be below.
Maybe those activities do not work the GPU in the same way Einstein tasks do. Or maybe they don't raise the temperature the way Einstein tasks do. My GPU temp went through the roof on its first CUDA task because the fan didn't speed up on Linux. Maybe yours is doing the same and that's why it's failing on Linux but not Windows.
Yes, if the names match it's the same task. The "zero status" and "Exit status -226" appear to disagree but that's because they come from 2 different sources. The "zero status" is the exit status reported by the Einstein application to BOINC client. The client relays that status to the manager which then includes it in the Messages.
The "Exit status -226" is the exit status the client gives to the task. Notice I am implying the task and the Einstein app are distinct. You can look up task exit codes in the BOINC FAQ (see the link in my sig) to see what they mean. For -226 you would come to this page where we see:
So now the question is "why did the Einstein app exit prematurely?" One very common reason is that the client or OS gets busy with some other job for more than 30 seconds and doesn't communicate with the app during that time. Then the app thinks BOINC has crashed so it exits without writing its finished file. When BOINC gets "unbusy" it sees that the app has exited without writing its finished file (the signal the app is finished crunching the task) so it starts the app again. This sometimes goes on for more than 99 times at which point BOINC kills the app and gives the task a -226 exit status. If it's due to the client not communicating with the app for 30 secs then you'll also see the message "no heartbeat from client" in the stderr which is the lower portion of the page you linked to here.
I don't see any "no heartbeat from client" messages on that page so I assume the app is exiting for some other reason. The reason could be this excerpt which does repeat many times in the stderr:
Maybe your video card doesn't have 25163828 bytes of CUDA HSP texture memory ? Maybe a bug in the Linux driver prevents the OS/BOINC/app from seeing that much HSP texture memory? Have you tried other drivers? The more I think about it that error probably isn't due to high temperature because it seems to be occurring before the crunching starts (the app has to allocate memory before it can crunch numbers).
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
Thank you for the detailed
)
Thank you for the detailed explanation Dagorath.
Now that you mention it I did update to the latest driver around the time this started to happen. I needed it to run Ubuntu 11.04. I'll see if I can find an older driver that works.
The issues with getting temperature out the darn thing is frustrating.
If I use "nvidia-settings -t -q [gpu:0]/GPUCoreTemp" it returns without outputting anything.
"nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeed" returns
ERROR: Invalid Fan 0 specified in query '[fan:0]/GPUCurrentFanSpeed' (there are only 0 Fans on this Display).
nvidia-smi says:
As far as memory:
Perhaps some memory is not getting released. I'll reboot after this is posted and see if the numbers are the same.
Joe
On boot up I see:
)
On boot up I see:
running boinc even though activity is set to Use GPU Never
Use GPU Always:
I'll let it go until this task completes and check again.
Joe
The nvidia-settings utility
)
The nvidia-settings utility has been a tad quirky in my experience, same with nvidia-smi. They seem to work slightly different depending on the driver. Let us know how things go with a different driver.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
As far as I can tell
)
As far as I can tell everything else works fine but Boinc is the only Cuda app I run besides some toying I've been doing to learn how to address it and some video processing in an unrelated project.
I believe I need at least version 270 these are the relatively current drivers available
Any recommendation on which might work better than 270.4106?
Joe
I recently downgraded to
)
I recently downgraded to 260.1944 from 270.4106. With 270.4106 nvidia-smi output N/A for most of the attributes including fan speed and temp. It works fine with 260.1944. Those are the only ones I've tried. 260.1944 is adequate for Einstein and GPUgrid, not sure about the other stuff you're working on.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
I'll try it and report
)
I'll try it and report back.
Joe
Well now I remember why I
)
Well now I remember why I felt I HAD to upgrade the NVidia drivers
with the 260 drivers 11.04 doesn't boot for me. I get:
So I'm not sure what to do.
This laptop doesn't do that much crunching per day through the GPU.
Can you point me to instructions to disable GPU tasks for this machine?
Joe
Hmmm. There may be a
)
Hmmm. There may be a problem. I've searched through some of the archive at the NVIDIA site and so far I haven't found a Linux driver recommended for your card.
From an image of GPUz running from one of your earlier posts I gather your card is model NVS3100M with GT218 chip. From there I went to NVIDIA's UNIX Driver Portal Page where I clicked the green "Archive " link at the bottom of the "Linux x86/IA32" section. On that page each driver version is listed under a bold black "Linux Display Driver - x86" link. Clicking any of those links takes you to a page with a tab named "Supported Products". I haven't tried every driver but so far the ones I've looked at (270.4106 and 260.1944) do not list NVS3100M or GT218 as a supported product. That's probably why nvidia-settings and nvidia-smi gave such wierd output with 270.4106.
Hopefully one of the other drivers is recommended. If not then it looks like you'll be able to use that GPU only on Windows.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
RE: Well now I remember why
)
$%#*& I was going to point you to a procedure to follow that would allow you to back out the driver if X server refused to boot. Sorry, I forgot.
I'm not sure what "video driver ABI" means and I'm not sure how to downgrade the X server.
If you're left at a command prompt or terminal login then you can uninstall the driver by running nvidia-uninstall as root. If you can't get to a ccommaand prompt then you might try booting with your Ubuntu install disk and see if you can do a rescue or boot to command line or something. Hopefully it'll mount the drive for you so you can run nvidia-uninstall. If that doesn't work the only thing I can recommend is reinstalling Ubuntu.
Not sure what you mean. You can go to Einstein preferences on you account page and deselect CUDA tasks. Or do you mean boot the OS in such a way that it won't use the GPU? No, I don't think you can do that. Or do you mean something else?
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux