Errors with GravWave Search - GPU

Werinbert
Werinbert
Joined: 31 Dec 12
Posts: 20
Credit: 100156387
RAC: 0
Topic 221469

As of today I have been getting the following error:

<message>
exceeded elapsed time limit 7912.18 (2880000.00G/364.00G)</message>

 
The computer with the issue is https://einsteinathome.org/host/12815268

The error doesn't occur all the time and there has been a task longer than the 7912 second limit that did not error out. I am also unsure why such a short time limit is in place to begin with.

Edit: I am not sure if it makes a difference...But I am running a Ryzen 3700X/GTX 750Ti on Linux Mint. Also after reading the thread from earlier in the month about GW - GPU errors (mainly asking about memory) I noticed that my rig takes a lot longer to process the tasks than others with the same card.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118392062050
RAC: 25659814

Werinbert wrote:The error

Werinbert wrote:
The error doesn't occur all the time and there has been a task longer than the 7912 second limit that did not error out.

There are several different known pulsars being targeted as potential sources of continuous GW.  Each one gives different crunching behaviour - such as crunch time estimates and time limits - so there wouldn't be a single fixed time limit for all tasks.  Initial task estimates are a lot shorter than the true crunch time so that means that time limits are also probably far shorter than they should be as well.

There have been volunteer comments in the past about the underestimates for crunch time so I'm sure the Devs are aware of this.  I don't know why this hasn't been addressed.  No explanation has been offered that I've noticed.

Werinbert wrote:
I am also unsure why such a short time limit is in place to begin with.

The time limit is probably some fixed multiple of the estimated 'work content' of a task - hence the problem for you if your GPU is running slower than it should be.

Werinbert wrote:
... I noticed that my rig takes a lot longer to process the tasks than others with the same card.

If you are using your CPU cores to run CPU tasks for other projects, it could be that your GPU doesn't have enough CPU support.  If so, you could try reducing the number of cores BOINC is allowed to use by one to see if that allows your GPU tasks to run faster.

Cheers,
Gary.

Werinbert
Werinbert
Joined: 31 Dec 12
Posts: 20
Credit: 100156387
RAC: 0

@Gary as you mentioned that

@Gary as you mentioned that there are multiple GW sources being targeted, I looked again at the WUs and the problem WUs seem to only be the G34731 tasks. So this supports your theory as to the underlying problem.

I did check on the issue with CPU load.  Default is 0.9 cores per task and giving it one dedicated core showed no improvement. However, giving it two dedicated cores did show improvement. None the less, I do feel that the run time limit is too low and not run the app on my computer as I can get better use out of my CPU cores than to babysit a mis-tuned app.

 

Arnaldy Medina
Arnaldy Medina
Joined: 22 Mar 20
Posts: 1
Credit: 120296
RAC: 0

So I have the error too with

So I have the error too with the G34731 tasks, I digged into the debugger log and notice some Deprecation warnings that I didn't found in the other tasks that I have done. Also the error points to an unhandled exception in the KERNELBASE.dll that was the main cause of the error. Maybe that's what Gary was mentioning about devs awareness on this.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 324810778
RAC: 188340

Here is a possibly related

Here is a possibly related error on my machine but also many others in the quorum had difficulty too. The Vela Junior pulsar is being analysed. The error report also mentions an unhandled exception, that condition being an 'access violation', plus there is a deprecation warning also.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118392062050
RAC: 25659814

In response to the two

In response to the two previous messages, the "deprecation warning" messages are quite normal.  I've been seeing those for months and they don't lead to any problems.  They just seem to be harmless and have been reported previously.  There's been no comment from the Devs about them that I've noticed.

The "TIME LIMIT EXCEEDED" errors have been seen before, usually for less capable GPUs that aren't really up to the job.  They also are more likely to occur if the tasks are taking a lot longer than was initially thought.  There was an example of this quite some time ago when VelaJr tasks were taking about double the time that was anticipated.  As a result, the estimates were doubled for those as was the credit award - if I remember correctly, as it was a while ago.

We have been doing more VelaJr tasks recently - one of my hosts is still doing them.  I run 3 at a time on an RX 570 which is a mid-range discrete GPU at best.  Three tasks get finished in about 36 mins - ie. ~12 mins per task.  There are a number of new G34731 tasks coming through so I've promoted a couple of those to crunch 'out of order' to see what the crunch time is like.  At the moment, the first of these is 50% complete after 30 mins so around a full hour to complete.

So it looks like this new batch may take close to twice as long as the previous VelaJr tasks.  They were actually estimated at half the time so it looks like the Devs may have to make some more adjustments to the estimates and correspondingly, to the time limit before the task is terminated.  I'll send a PM to Bernd and ask him to have a look at this.

Cheers,
Gary.

Werinbert
Werinbert
Joined: 31 Dec 12
Posts: 20
Credit: 100156387
RAC: 0

I do hope the Devs extend the

I do hope the Devs extend the time limit, if so I may go back to running these tasks.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251675683
RAC: 35136

Thanks for the note. I'm

Thanks for the note. I'm still waiting for feedback from the scientists on that new setup. For the time being I doubled the "flops estimation" (and credit), which should aslo double the runtime limit (for newly generated workunits, sorry).

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118392062050
RAC: 25659814

Thanks Bernd.

Thanks Bernd.

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I wonder if some of the tasks

I wonder if some of the tasks require more memory than Nvidia card with 2 GB is able to offer in practice. My 2GB GTX 960 are running only one task at a time per card. In the last couple of days all these three hosts have started to face clearly more computation errors. Tasks crash in about 100 seconds. Here's an example: https://einsteinathome.org/task/935886632

In the stderr there's always at first this bold info how the problem started:

XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE

I've seen exactly this same happening for others running some other 2 GB Nvidia cards... for example GTX 1050, 950, 760, 660 models. On the other hand, I don't think I've yet seen that happening with Nvidia cards with 4 GB or more.

I know some of these GW GPU tasks fill the GPU memory up so that almost all of the 2 GB is in use while there's nothing else than Boinc open and one task running. But I'm starting to think that some tasks require more memory...  and the problem might be that the project server isn't able to exclude any host with not enough memory from getting those large tasks. So basically this is just the same thing that was earlier in place already with 1 GB cards.

Another thing... number of validate errors have started accumulating in last few days. But I see that might involve many users and many cards... upper 1000- and 900-series Nvidia cards running many different driver versions and both Windows and Linux. Naturally my AMD cards got many validate errors already. But they never seem to keep themselves out of troubles.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 324810778
RAC: 188340

I think you're right. Some

I think you're right. Some research on CL_MEM_OBJECT_ALLOCATION_FAILURE indicates that it signals a generic failure to find enough available memory for some given request at a certain time. That can be the card memory is too small and/or might mean that previously allocated buffers, no longer needed, haven't been released/deallocated. But FFTs are memory hungry beasts so the former is likely. Interestingly the error may not be emitted when the memory is requested but when the memory is first used ( so called lazy allocation by the OpenCL implementation ).

{ This computer has a 2GB Nvidia card and shows this failure mode in three recent Vela Junior tasks. }

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.