My GTX 980ti keeps crashing on All-Sky Gravitational Wave search on O3 1.04 (GW-opencl-nvidia)

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 91054866
RAC: 187659
Topic 230062

Win 11 Pro 22H2 Version 10.0.22621.2215  nVidia driver  537.13 BOINC 7.24.1 https://einsteinathome.org/host/12880658

Screens go blank and have to hard reset system.  Other CUDA BOINC projects work fine.

A few of them made it but I aborted the rest. https://einsteinathome.org/workunit/750962034

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118327932465
RAC: 25398127

Have you followed the

Have you followed the announcement and subsequent discussion in Tech News?

A number of users have reported GPU VRAM requirements of around 4GB.  You would also need more for the OS.  Your GPU shows as having 6GB so single tasks should work.  Are you attempting to run multiples?

I looked at the stderr output for a failed task. This is the error message:-

OpenCL Create Context failed with OpenCL error: CL_OUT_OF_HOST_MEMORY

To check this for yourself, just click the Task ID link for an error task in your list of tasks on the website and scroll down to find the actual error message.  Maybe with the more recent tasks there are even higher memory requirements than the reported 4GB.

Cheers,
Gary.

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 91054866
RAC: 187659

I have 64GB of RAM. There

I have 64GB of RAM.

There are no error tasks.  my GPU crashes, I reboot and the task usually succeeds.  I've aborted all other jobs.

The task code should do a better job at resource detection and enforcement before the task starts  and if insufficient, abort the job or throw some kind of exception.

It should not be a system stability crap shoot to run tasks.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3117
Credit: 5008566749
RAC: 1562335

lohphat wrote: I have 64GB

lohphat wrote:

I have 64GB of RAM.

There are no error tasks.  my GPU crashes, I reboot and the task usually succeeds.  I've aborted all other jobs.

The task code should do a better job at resource detection and enforcement before the task starts  and if insufficient, abort the job or throw some kind of exception.

It should not be a system stability crap shoot to run tasks.

Let me try to help...

First, you may have 64GB of RAM in your system, which is well and good, but Gary is referring to the amount of VRAM on your GPU.  That may well be an issue with some BOINC tasks using GPUs.

Second, I don't know if you realize it or not, but your computer(s) is/are hidden.  If you were to un-hide your computer(s) it would be much easier for myself as well as others, such as Gary, to help you.  We could then see exactly what your system is and also take a look at other things like what GPU & CPU tasks you are running, as well as how many tasks, etc.  Right now Gary is only guessing at what you are running, as am I.

And as of now, it is "a crap shoot" (as you call it) to help you figure out what is wrong.

George

Proud member of the Old Farts Association

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118327932465
RAC: 25398127

GWGeorge007 wrote:... I

GWGeorge007 wrote:

... I don't know if you realize it or not, but your computer(s) is/are hidden.

I'm sure he does realise it since he provided a link to the host in question.  It's quite OK for a user to choose whether or not to show their computers. It's their right to privacy.  If they ask questions and provide the link to the host, there's no problem since that link contains a further link to the tasks being run.  That's where I found 3 compute errors.  I got the error message from one of those.

His GTX 980 Ti does have 6GB so in theory he probably could run the GW search without error.  Maybe the crashes are more to do with the age of the GPU or perhaps even with trying to run multiples - that wasn't revealed.  He said he didn't have errors but his task list told a different story.   Yes, he did seem to confuse host RAM with VRAM, but hey, we all make mistakes :-).

Cheers,
Gary.

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 91054866
RAC: 187659

"A number of users have

"A number of users have reported GPU VRAM requirements of around 4GB.  You would also need more for the OS."

That's why I listed the system RAM.

Since it's only a single GPU only one GPU task was running.

No CPU tasks were running.

Other BOINC CUDA tasks don't crash the GPU.  My system is stalbe and I'm able to run 3dMark benchmarks without issue.

The GPU/driver crashes seemed to happen if the OS was running an app which has GPU acceleration (e.g. Firefox).

Are there any other logs which may be in the client not sent back to the server you might need?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48033537244
RAC: 35317493

You again misunderstand. “You

You again misunderstand. “You need more for the OS” is also in reference to VRAM, as the OS uses some small amount of VRAM to run the desktop environment. nothing to do with system RAM or CPU tasks at all. 
 

were (or are) you trying to run two or more O3AS gravitational tasks at once? The ~4GB VRAM requirement is per task and therefor additive, not absolute. If you run two for example, you’d need about 8GB VRAM, which your GPU would be insufficient for. 
 

“other BOINC CUDA” tasks are irrelevant as they don’t have the same VRAM requirements. Most projects have fairly low VRAM used. Very few use 4+ GB like these GW tasks do. 

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118327932465
RAC: 25398127

Ian&Steve C. wrote:were (or

Ian&Steve C. wrote:
were (or are) you trying to run two or more O3AS gravitational tasks at once?

The reply from Lohphat mentioned, "Since it's only a single GPU only one GPU task was running."  I guess the GPU utilization factor or an app_config.xml file weren't being used so there should have been enough VRAM.  That seems to point more towards a hardware issue perhaps not even on the GPU.

There were 3 compute errors and about 10 completed and validated when I looked.  Most of the rest were aborted.  His comment about an intermittent crash and then the task finishing OK after a restart might indicate the issue is to do with the motherboard.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118327932465
RAC: 25398127

lohphat wrote:...  That's why

lohphat wrote:
...  That's why I listed the system RAM.

As has already been pointed out, VRAM refers to the memory on the discrete GPU.  Your desktop and any other video related apps you might have been running will also need to share the VRAM.  Having indicated that only single GPU tasks were running, the 980Ti's 6GB should have been sufficient.  However, the error message I quoted did point at insufficient VRAM (quite independent of motherboard RAM) so the next question has to be, "were you also running any other app that might need lots of VRAM?"

From time to time on my computers, I also see system crashes where the GPU task, after a system restart, can successfully complete.  A lot of the time (since I have a lot of legacy systems that have had a more modern GPU added) the cause of the crash turned out to be faulty electrolytic capacitors on the motherboard.  You appear to have a much more modern system so it's unlikely to be that.  I'm just trying to indicate that a suspect component somewhere can trigger random crashes when the job being run puts the system under sufficient stress.  In your case, the GPU itself is probably the oldest component and so you should concentrate on that.

You mention firefox and video acceleration.  I don't know enough about that.  Virtually all my hosts only run Einstein searches.  They hardly ever have monitors attached and they certainly don't run anything else of significance.

Also, just because other projects don't have these crashes, you can't assume it's an Einstein problem.  The stress put on a machine can vary significantly between different projects.  Other projects may be less stressful on whatever is triggering the crash.  What you describe does sound like some sort of hardware related issue.

As to other logs, the stderr output on the website for any returned task (successfully completed or not) is the best place to try to understand what was happening when things went wrong.  There wont necessarily be an error message, certainly not if the task itself was a victim rather than the instigator of the crash.  If you have known tasks that successfully completed after a system crash, check the stderr output since it will show where the task crashed and what happened after the restart.  If there is no actual error message there, the reason for the crash most likely was triggered elsewhere.

You do have 3 error results.  I only checked one.  You should check them all to see if there is a consistent message.

When I get problems like this, I usually test any suspect components (eg the GPU) by swapping between reasonably similar machines.  Whether the problem transfers or stays will usually indicate which bit is faulty.  Unfortunately, you may not be able to do that.

 

Cheers,
Gary.

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 91054866
RAC: 187659

I know the difference between

I know the difference between VRAM and MB RAM.  It's a stock GTX 980Ti.  It's only running one CUDA task at a time.

The GPU/driver crashed only when running this BOINC app not on any other BOINC GPU app, stress test, or benchmark.

I can run a full load of CPU tasks concurrently -- no crashes.  Other CUDA GPU tasks from other projects -- no crashes.

The crashes only occur with this BOINC app.  It MIGHT be the driver as the last time I had similar crashes it was the nVidia driver version, that was years ago.

The OS itself isn't crashing, just the GPU/driver.  Sometimes the driver could restart and bring the screens back, but usually the rest of the OS would continue running but with back screens (no signal).

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4458
Credit: 3260997497
RAC: 1852998

Have you tried to run other

Have you tried to run other GPU applications from Einstein? 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.