"Exited with zero status but no 'finished' file"

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

RE: This is confusing the

Quote:

This is confusing the heck out of me.

I'm getting the errors again at least in Linux. I've been running Windows for a bit and it seems to doing OK with these CUDA tasks.

I found one program that reports a temperature for the GPU seems to operate about 65C when processing and about 35 when idle. For the life of me I can't see any reason why NVidia software doesn't report temperature.

Are you referring to the Nvidia software in Linux?

The nvidia-settings GUI is a bit fickle. It may not work for you. Try copying and pasting the following command into a Linux terminal, don't delete the quotes. If the command fails then please copy and paste the output here. If it runs you'll see the GPU temp and fan speed displayed and updated every 2 seconds.

watch -t "nvidia-settings -t -q [gpu:0]/GPUCoreTemp && nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeed"

The temp will be on the top;the fan speed will be below.

Quote:
If it's a hardware failure it sure is particular. I can do accelerated 3D rendering, video decoding and FFTs without a problem.

Maybe those activities do not work the GPU in the same way Einstein tasks do. Or maybe they don't raise the temperature the way Einstein tasks do. My GPU temp went through the roof on its first CUDA task because the fan didn't speed up on Linux. Maybe yours is doing the same and that's why it's failing on Linux but not Windows.

Quote:

As far as error reports on the tasks, I still don't get how to match a task in the BOINC messages with a report from http://einstein.phys.uwm.edu/result.php?

For example the message says:

Sun 03 Jul 2011 02:26:18 PM PDT Einstein@Home Task PM0147_034B1.dm_120_1 exited with zero status but no 'finished' file

but the task says:

Name	PM0147_034B1.dm_120_1
Workunit	100178936
Created	3 Jul 2011 6:31:47 UTC
Sent	3 Jul 2011 21:18:17 UTC
Received	3 Jul 2011 21:27:34 UTC
Server state	Over
Outcome	Client error
Client state	Compute error
Exit status	-226 (0xffffffffffffff1e)

It's the right time and the names match but the "zero status" doesn't look right.

If those names match is it the right task?

Yes, if the names match it's the same task. The "zero status" and "Exit status -226" appear to disagree but that's because they come from 2 different sources. The "zero status" is the exit status reported by the Einstein application to BOINC client. The client relays that status to the manager which then includes it in the Messages.

The "Exit status -226" is the exit status the client gives to the task. Notice I am implying the task and the Einstein app are distinct. You can look up task exit codes in the BOINC FAQ (see the link in my sig) to see what they mean. For -226 you would come to this page where we see:

Quote:

ERR_TOO_MANY_EXITS -226

An application has exited prematurely (unexpectedly) more than 99 times without generating a checkpoint, so giving up on that task.

So now the question is "why did the Einstein app exit prematurely?" One very common reason is that the client or OS gets busy with some other job for more than 30 seconds and doesn't communicate with the app during that time. Then the app thinks BOINC has crashed so it exits without writing its finished file. When BOINC gets "unbusy" it sees that the app has exited without writing its finished file (the signal the app is finished crunching the task) so it starts the app again. This sometimes goes on for more than 99 times at which point BOINC kills the app and gives the task a -226 exit status. If it's due to the client not communicating with the app for 30 secs then you'll also see the message "no heartbeat from client" in the stderr which is the lower portion of the page you linked to here.

I don't see any "no heartbeat from client" messages on that page so I assume the app is exiting for some other reason. The reason could be this excerpt which does repeat many times in the stderr:

[ERROR] Couldn't allocate 25163828 bytes of CUDA HSP texture memory (error: 2)!
[ERROR] Demodulation failed (error: 1006)!
[WARN ] CUDA memory allocation problem encountered!
------> Returning control to BOINC, delaying restart for at least five minutes...

Maybe your video card doesn't have 25163828 bytes of CUDA HSP texture memory ? Maybe a bug in the Linux driver prevents the OS/BOINC/app from seeing that much HSP texture memory? Have you tried other drivers? The more I think about it that error probably isn't due to high temperature because it seems to be occurring before the crunching starts (the app has to allocate memory before it can crunch numbers).

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Thank you for the detailed

Thank you for the detailed explanation Dagorath.

Now that you mention it I did update to the latest driver around the time this started to happen. I needed it to run Ubuntu 11.04. I'll see if I can find an older driver that works.

The issues with getting temperature out the darn thing is frustrating.

If I use "nvidia-settings -t -q [gpu:0]/GPUCoreTemp" it returns without outputting anything.

"nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeed" returns

ERROR: Invalid Fan 0 specified in query '[fan:0]/GPUCurrentFanSpeed' (there are only 0 Fans on this Display).

nvidia-smi says:

    Fan Speed                   : N/A
    Temperature
        Gpu                     : N/A

As far as memory:

    Memory Usage
        Total                   : 511 Mb
        Used                    : 255 Mb
        Free                    : 256 Mb


Perhaps some memory is not getting released. I'll reboot after this is posted and see if the numbers are the same.

Joe

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

On boot up I see:

On boot up I see:

    Memory Usage
        Total                   : 511 Mb
        Used                    : 81 Mb
        Free                    : 429 Mb

running boinc even though activity is set to Use GPU Never

    Memory Usage
        Total                   : 511 Mb
        Used                    : 127 Mb
        Free                    : 384 Mb

Use GPU Always:

    Memory Usage
        Total                   : 511 Mb
        Used                    : 366 Mb
        Free                    : 145 Mb
 

I'll let it go until this task completes and check again.

Joe

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

The nvidia-settings utility

The nvidia-settings utility has been a tad quirky in my experience, same with nvidia-smi. They seem to work slightly different depending on the driver. Let us know how things go with a different driver.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

As far as I can tell

As far as I can tell everything else works fine but Boinc is the only Cuda app I run besides some toying I've been doing to learn how to address it and some video processing in an unrelated project.

I believe I need at least version 270 these are the relatively current drivers available

  	Name 	Version 	Release Date
	Linux x64 (AMD64/EM64T) Display Driver  NVIDIA Recommended 	270.4106 	April 20, 2011
	Linux x64 (AMD64/EM64T) Display Driver 	260.1944 	March 7, 2011
	Linux x64 (AMD64/EM64T) Display Driver  BETA 	270.26 	February 21, 2011
	Linux x64 (AMD64/EM64T) Display Driver 	260.1936 	January 21, 2011
	Linux x64 (AMD64/EM64T) Display Driver  BETA 	270.18 	January 21, 2011
	Linux x64 (AMD64/EM64T) Display Driver 	260.1929 	December 13, 2010


Any recommendation on which might work better than 270.4106?

Joe

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

I recently downgraded to


I recently downgraded to 260.1944 from 270.4106. With 270.4106 nvidia-smi output N/A for most of the attributes including fan speed and temp. It works fine with 260.1944. Those are the only ones I've tried. 260.1944 is adequate for Einstein and GPUgrid, not sure about the other stuff you're working on.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

I'll try it and report

I'll try it and report back.

Joe

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Well now I remember why I

Well now I remember why I felt I HAD to upgrade the NVidia drivers

with the 260 drivers 11.04 doesn't boot for me. I get:

This server has a video driver ABI version of 10.0 that is not
supported by this NVIDIA driver.  Please check
http://www.nvidia.com/ for driver updates or downgrade to an X
server with a supported driver ABI.

So I'm not sure what to do.

This laptop doesn't do that much crunching per day through the GPU.

Can you point me to instructions to disable GPU tasks for this machine?

Joe

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

Hmmm. There may be a


Hmmm. There may be a problem. I've searched through some of the archive at the NVIDIA site and so far I haven't found a Linux driver recommended for your card.

From an image of GPUz running from one of your earlier posts I gather your card is model NVS3100M with GT218 chip. From there I went to NVIDIA's UNIX Driver Portal Page where I clicked the green "Archive " link at the bottom of the "Linux x86/IA32" section. On that page each driver version is listed under a bold black "Linux Display Driver - x86" link. Clicking any of those links takes you to a page with a tab named "Supported Products". I haven't tried every driver but so far the ones I've looked at (270.4106 and 260.1944) do not list NVS3100M or GT218 as a supported product. That's probably why nvidia-settings and nvidia-smi gave such wierd output with 270.4106.

Hopefully one of the other drivers is recommended. If not then it looks like you'll be able to use that GPU only on Windows.

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

RE: Well now I remember why

Quote:

Well now I remember why I felt I HAD to upgrade the NVidia drivers

with the 260 drivers 11.04 doesn't boot for me.

$%#*& I was going to point you to a procedure to follow that would allow you to back out the driver if X server refused to boot. Sorry, I forgot.

Quote:

I get:

This server has a video driver ABI version of 10.0 that is not
supported by this NVIDIA driver.  Please check
http://www.nvidia.com/ for driver updates or downgrade to an X
server with a supported driver ABI.

So I'm not sure what to do.

I'm not sure what "video driver ABI" means and I'm not sure how to downgrade the X server.

If you're left at a command prompt or terminal login then you can uninstall the driver by running nvidia-uninstall as root. If you can't get to a ccommaand prompt then you might try booting with your Ubuntu install disk and see if you can do a rescue or boot to command line or something. Hopefully it'll mount the drive for you so you can run nvidia-uninstall. If that doesn't work the only thing I can recommend is reinstalling Ubuntu.

Quote:

This laptop doesn't do that much crunching per day through the GPU.

Can you point me to instructions to disable GPU tasks for this machine?

Not sure what you mean. You can go to Einstein preferences on you account page and deselect CUDA tasks. Or do you mean boot the OS in such a way that it won't use the GPU? No, I don't think you can do that. Or do you mean something else?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.