Today the computation errors came back even running 1 at a time. The only saving grace is they continue to bomb out in ~ 1min. So I had an 11 day run w/o an error, now I'm getting a bunch.
All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE.
I'm sorry for your pain, but thanks for reporting.
I notice that all the latest tasks being received are labeled VelaJr1 (ie. looking at the Vela Pulsar) but in my case, I have a big enough cache so that it will be a little while yet before those reach the top of the queue. I still have quite a few G34731 to go (running 3x). I've suspended all those that are waiting and have allowed a single VelaJr1 task to start when one of the current three finish.
That's just happened and so far no problems. Two G34731 plus one VelaJr1 are running without a problem. When the two old tasks finish, I'll allow a 2nd VelaJr1 task to start. If that is OK, I'll allow a third.
20 mins later:
Well, the 2nd VelaJr1 task started OK so I allowed both to run for a while and then tried a third. That failed after 7 secs whilst the other two kept going. The GPU is a 4GB RX 570 so it looks like you might need (roughly) close to 2GB per concurrent task. I remember someone (Zalster, I think, or perhaps Keith) mentioning that nvidia crippled the memory use on consumer grade cards being used for compute so as to force the use of their much more expensive professional range for that purpose. If true, that might be why both 3GB and 2GB nvidia cards are having problems with these tasks. So betreger's "poison pills" might more appropriately be described as nvidias "dirty tricks" :-(.
The two running VelaJr1 tasks have now finished successfully. They took 55 mins and 53 mins respectively so averaging 27 mins per task, based on just those two. In other words, the run time has doubled.
The previous G34731 tasks had been taking ~36 mins at 3x, so ~12 mins per task. So, looks like I'll need to change back to 2x when the G34731 tasks finally run out :-(. So that machine will be going from G34731 task every ~12 mins to a VelaJr1 task every ~27 mins. I'll need to watch closely in case 2x turns out not to be viable after all.
Both Keith and I have commented on the RAM availability on NVIDIA cards. It's hard locked at 27% of the total Ram of the card. Both him and I reviewed a white paper several years back on the subject.
I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, but with my 3GB GTX 1060's after 21-26secs of starting a VelaJr1 task the task goes to 100% and I get "exited with zero status but no 'finished' file" as well as "If this happens repeatedly you may need to reset the project" and then the task just restarts again.
Ha, I've reset the project twice and I'm still getting the same results with them, but it only seems to happen with these VelaJr1 tasks so I'm guessing that my problem could be related to this.
Excuse my jumping in here, but I think I have the same problem. Just got home from work and found 59 errors on my main system which has two GPUs. All of the errors are VelaJr1 tasks, and all failed on the GTX 1060 3GB.
I am going to try to figure out how to exclude that GPU from Einstein work and see if I get any errors on the other card, a 1660 with 6GB. That will have to wait until tomorrow, right now I'm in the Einstinian Dog House :-(
You need to write an app_config file for Einstein excluding the 3GB card from the project. The instructions can be followed at the reference document page.
Wiggo, try restarting your computer and see if it goes away.
Already did that 3 times before the original post Zalster. :-(
I also set NNT hours ago, but I keep getting sent these "lost tasks" that are also Velajr1 work that all end in the same result even when suspending all CPU tasks. :-(
I'm also getting a lot of errors since a few days. Have an older GTX 750 Ti with 2 GB of VRAM. When the WU starts it quickly fills up all the VRAM of the card and errors out. Before these few days, I was crunching successfully E@H GW tasks. I've detatched from the project until this can be resolved
I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, ....
For the tasks on the website that show as computation errors, you certainly were.
You don't see the error message in the event log on your machine. You need to go to the list of tasks on the website and choose any one showing as a computation error. If you click the "Tasks ID" link for that task, you get to see the whole of the stderr.txt that was returned to the project after the task failed. That is where you get to see a better idea of what caused the failure. In other words, there was a failure to allocate sufficient memory for the job to run.
Unfortunately, there seems to be enough evidence to suggest that these latest high frequency VelaJr1 branded tasks require more than 3GB of memory to run as singles on nvidia GPUs. If you're in that boat, the safest way to avoid the angst of failed tasks is to switch over to the Gamma-ray Pulsar GPU tasks and deselect the GW stuff.
Get some new GRP tasks before aborting all the old GW tasks. That way, you'll have something to crunch and return if you have to abort so many that you use up your daily quota and the project gives you a 24 hour backoff. As soon as you successfully crunch a GRP task, force an update to return it (even if backed off). That will start to restore your daily limit. Rinse and repeat as necessary. After that, you should hopefully have no further problems.
If anyone wants more details on what people are experiencing, take a look at the messages a little earlier in this thread. My guess is that the project wants to get this work done 'as it is' even if that means excluding GPUs with lower available memory. I doubt they can change the task configuration to avoid the problem, certainly not at short notice. The Devs will see the failures, so hopefully we will get some recommendations/suggestions on how to deal with this once they've had time to analyse the issue..
Today the computation errors
)
Today the computation errors came back even running 1 at a time. The only saving grace is they continue to bomb out in ~ 1min. So I had an 11 day run w/o an error, now I'm getting a bunch.
Richie wrote:All errored out
)
I'm sorry for your pain, but thanks for reporting.
I notice that all the latest tasks being received are labeled VelaJr1 (ie. looking at the Vela Pulsar) but in my case, I have a big enough cache so that it will be a little while yet before those reach the top of the queue. I still have quite a few G34731 to go (running 3x). I've suspended all those that are waiting and have allowed a single VelaJr1 task to start when one of the current three finish.
That's just happened and so far no problems. Two G34731 plus one VelaJr1 are running without a problem. When the two old tasks finish, I'll allow a 2nd VelaJr1 task to start. If that is OK, I'll allow a third.
20 mins later:
Well, the 2nd VelaJr1 task started OK so I allowed both to run for a while and then tried a third. That failed after 7 secs whilst the other two kept going. The GPU is a 4GB RX 570 so it looks like you might need (roughly) close to 2GB per concurrent task. I remember someone (Zalster, I think, or perhaps Keith) mentioning that nvidia crippled the memory use on consumer grade cards being used for compute so as to force the use of their much more expensive professional range for that purpose. If true, that might be why both 3GB and 2GB nvidia cards are having problems with these tasks. So betreger's "poison pills" might more appropriately be described as nvidias "dirty tricks" :-(.
The two running VelaJr1 tasks have now finished successfully. They took 55 mins and 53 mins respectively so averaging 27 mins per task, based on just those two. In other words, the run time has doubled.
The previous G34731 tasks had been taking ~36 mins at 3x, so ~12 mins per task. So, looks like I'll need to change back to 2x when the G34731 tasks finally run out :-(. So that machine will be going from G34731 task every ~12 mins to a VelaJr1 task every ~27 mins. I'll need to watch closely in case 2x turns out not to be viable after all.
Thanks for the "heads up" guys!
Cheers,
Gary.
Both Keith and I have
)
Both Keith and I have commented on the RAM availability on NVIDIA cards. It's hard locked at 27% of the total Ram of the card. Both him and I reviewed a white paper several years back on the subject.
I'm not getting the
)
I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, but with my 3GB GTX 1060's after 21-26secs of starting a VelaJr1 task the task goes to 100% and I get "exited with zero status but no 'finished' file" as well as "If this happens repeatedly you may need to reset the project" and then the task just restarts again.
Ha, I've reset the project twice and I'm still getting the same results with them, but it only seems to happen with these VelaJr1 tasks so I'm guessing that my problem could be related to this.
Cheers.
Wiggo, try restarting your
)
Wiggo, try restarting your computer and see if it goes away.
Excuse my jumping in here,
)
Excuse my jumping in here, but I think I have the same problem. Just got home from work and found 59 errors on my main system which has two GPUs. All of the errors are VelaJr1 tasks, and all failed on the GTX 1060 3GB.
I am going to try to figure out how to exclude that GPU from Einstein work and see if I get any errors on the other card, a 1660 with 6GB. That will have to wait until tomorrow, right now I'm in the Einstinian Dog House :-(
If anyone wants to take a look, this is the host:
https://einsteinathome.org/host/12820614
You need to write an
)
You need to write an app_config file for Einstein excluding the 3GB card from the project. The instructions can be followed at the reference document page.
https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration
Zalster wrote:Wiggo, try
)
Already did that 3 times before the original post Zalster. :-(
I also set NNT hours ago, but I keep getting sent these "lost tasks" that are also Velajr1 work that all end in the same result even when suspending all CPU tasks. :-(
Cheers.
I'm also getting a lot of
)
I'm also getting a lot of errors since a few days. Have an older GTX 750 Ti with 2 GB of VRAM. When the WU starts it quickly fills up all the VRAM of the card and errors out. Before these few days, I was crunching successfully E@H GW tasks. I've detatched from the project until this can be resolved
Wiggo wrote:I'm not getting
)
For the tasks on the website that show as computation errors, you certainly were.
You don't see the error message in the event log on your machine. You need to go to the list of tasks on the website and choose any one showing as a computation error. If you click the "Tasks ID" link for that task, you get to see the whole of the stderr.txt that was returned to the project after the task failed. That is where you get to see a better idea of what caused the failure. In other words, there was a failure to allocate sufficient memory for the job to run.
Unfortunately, there seems to be enough evidence to suggest that these latest high frequency VelaJr1 branded tasks require more than 3GB of memory to run as singles on nvidia GPUs. If you're in that boat, the safest way to avoid the angst of failed tasks is to switch over to the Gamma-ray Pulsar GPU tasks and deselect the GW stuff.
Get some new GRP tasks before aborting all the old GW tasks. That way, you'll have something to crunch and return if you have to abort so many that you use up your daily quota and the project gives you a 24 hour backoff. As soon as you successfully crunch a GRP task, force an update to return it (even if backed off). That will start to restore your daily limit. Rinse and repeat as necessary. After that, you should hopefully have no further problems.
If anyone wants more details on what people are experiencing, take a look at the messages a little earlier in this thread. My guess is that the project wants to get this work done 'as it is' even if that means excluding GPUs with lower available memory. I doubt they can change the task configuration to avoid the problem, certainly not at short notice. The Devs will see the failures, so hopefully we will get some recommendations/suggestions on how to deal with this once they've had time to analyse the issue..
Cheers,
Gary.