As of today I have been getting the following error:
<message> exceeded elapsed time limit 7912.18 (2880000.00G/364.00G)</message>
The computer with the issue is https://einsteinathome.org/host/12815268
The error doesn't occur all the time and there has been a task longer than the 7912 second limit that did not error out. I am also unsure why such a short time limit is in place to begin with.
Edit: I am not sure if it makes a difference...But I am running a Ryzen 3700X/GTX 750Ti on Linux Mint. Also after reading the thread from earlier in the month about GW - GPU errors (mainly asking about memory) I noticed that my rig takes a lot longer to process the tasks than others with the same card.
Copyright © 2024 Einstein@Home. All rights reserved.
Werinbert wrote:The error
)
There are several different known pulsars being targeted as potential sources of continuous GW. Each one gives different crunching behaviour - such as crunch time estimates and time limits - so there wouldn't be a single fixed time limit for all tasks. Initial task estimates are a lot shorter than the true crunch time so that means that time limits are also probably far shorter than they should be as well.
There have been volunteer comments in the past about the underestimates for crunch time so I'm sure the Devs are aware of this. I don't know why this hasn't been addressed. No explanation has been offered that I've noticed.
The time limit is probably some fixed multiple of the estimated 'work content' of a task - hence the problem for you if your GPU is running slower than it should be.
If you are using your CPU cores to run CPU tasks for other projects, it could be that your GPU doesn't have enough CPU support. If so, you could try reducing the number of cores BOINC is allowed to use by one to see if that allows your GPU tasks to run faster.
Cheers,
Gary.
@Gary as you mentioned that
)
@Gary as you mentioned that there are multiple GW sources being targeted, I looked again at the WUs and the problem WUs seem to only be the G34731 tasks. So this supports your theory as to the underlying problem.
I did check on the issue with CPU load. Default is 0.9 cores per task and giving it one dedicated core showed no improvement. However, giving it two dedicated cores did show improvement. None the less, I do feel that the run time limit is too low and not run the app on my computer as I can get better use out of my CPU cores than to babysit a mis-tuned app.
So I have the error too with
)
So I have the error too with the G34731 tasks, I digged into the debugger log and notice some Deprecation warnings that I didn't found in the other tasks that I have done. Also the error points to an unhandled exception in the KERNELBASE.dll that was the main cause of the error. Maybe that's what Gary was mentioning about devs awareness on this.
Here is a possibly related
)
Here is a possibly related error on my machine but also many others in the quorum had difficulty too. The Vela Junior pulsar is being analysed. The error report also mentions an unhandled exception, that condition being an 'access violation', plus there is a deprecation warning also.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
In response to the two
)
In response to the two previous messages, the "deprecation warning" messages are quite normal. I've been seeing those for months and they don't lead to any problems. They just seem to be harmless and have been reported previously. There's been no comment from the Devs about them that I've noticed.
The "TIME LIMIT EXCEEDED" errors have been seen before, usually for less capable GPUs that aren't really up to the job. They also are more likely to occur if the tasks are taking a lot longer than was initially thought. There was an example of this quite some time ago when VelaJr tasks were taking about double the time that was anticipated. As a result, the estimates were doubled for those as was the credit award - if I remember correctly, as it was a while ago.
We have been doing more VelaJr tasks recently - one of my hosts is still doing them. I run 3 at a time on an RX 570 which is a mid-range discrete GPU at best. Three tasks get finished in about 36 mins - ie. ~12 mins per task. There are a number of new G34731 tasks coming through so I've promoted a couple of those to crunch 'out of order' to see what the crunch time is like. At the moment, the first of these is 50% complete after 30 mins so around a full hour to complete.
So it looks like this new batch may take close to twice as long as the previous VelaJr tasks. They were actually estimated at half the time so it looks like the Devs may have to make some more adjustments to the estimates and correspondingly, to the time limit before the task is terminated. I'll send a PM to Bernd and ask him to have a look at this.
Cheers,
Gary.
I do hope the Devs extend the
)
I do hope the Devs extend the time limit, if so I may go back to running these tasks.
Thanks for the note. I'm
)
Thanks for the note. I'm still waiting for feedback from the scientists on that new setup. For the time being I doubled the "flops estimation" (and credit), which should aslo double the runtime limit (for newly generated workunits, sorry).
BM
Thanks Bernd.
)
Thanks Bernd.
Cheers,
Gary.
I wonder if some of the tasks
)
I wonder if some of the tasks require more memory than Nvidia card with 2 GB is able to offer in practice. My 2GB GTX 960 are running only one task at a time per card. In the last couple of days all these three hosts have started to face clearly more computation errors. Tasks crash in about 100 seconds. Here's an example: https://einsteinathome.org/task/935886632
In the stderr there's always at first this bold info how the problem started:
XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE
I've seen exactly this same happening for others running some other 2 GB Nvidia cards... for example GTX 1050, 950, 760, 660 models. On the other hand, I don't think I've yet seen that happening with Nvidia cards with 4 GB or more.
I know some of these GW GPU tasks fill the GPU memory up so that almost all of the 2 GB is in use while there's nothing else than Boinc open and one task running. But I'm starting to think that some tasks require more memory... and the problem might be that the project server isn't able to exclude any host with not enough memory from getting those large tasks. So basically this is just the same thing that was earlier in place already with 1 GB cards.
Another thing... number of validate errors have started accumulating in last few days. But I see that might involve many users and many cards... upper 1000- and 900-series Nvidia cards running many different driver versions and both Windows and Linux. Naturally my AMD cards got many validate errors already. But they never seem to keep themselves out of troubles.
I think you're right. Some
)
I think you're right. Some research on CL_MEM_OBJECT_ALLOCATION_FAILURE indicates that it signals a generic failure to find enough available memory for some given request at a certain time. That can be the card memory is too small and/or might mean that previously allocated buffers, no longer needed, haven't been released/deallocated. But FFTs are memory hungry beasts so the former is likely. Interestingly the error may not be emitted when the memory is requested but when the memory is first used ( so called lazy allocation by the OpenCL implementation ).
{ This computer has a 2GB Nvidia card and shows this failure mode in three recent Vela Junior tasks. }
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal