I have three hosts with GPUs which have been running Parkes work with the non-Beta application happily since yesterday.
On noticing that two of them got BRP6-Beta-cuda32-nv301 work this morning, I somewhat extended queue size requested for all three, updated, waited for stability, then suspended the running non-Beta work to get an early result on the Beta (which unlike some other beta work was sent with an ordinary deadline, so did not get automatically high priority).
On two of the three hosts, all (approximately ten) of the beta tasks initiated and terminated.
Here is an example stderr from one offending host.
and here is a representative one from the other.
Both contain text like this:
7.4.36
Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)
Activated exception handling...
[06:42:27][8388][INFO ] Starting data processing...
[06:42:27][8388][ERROR] No suitable CUDA device available!
[06:42:27][8388][ERROR] Demodulation failed (error: 1001)!
06:42:27 (8388): called boinc_finish(1001)
The so far successful host is pretty similar to the other two, in that all are Windows 7 machines, running Nvidia GPUs with the 34460 driver. Possibly interesting is that the happy host is running BOINC version 7.3.11 while the two unhappy ones are running 7.4.27 and 7.4.36.
Copyright © 2024 Einstein@Home. All rights reserved.
trouble with BRP6 -Beta-cuda32-nv301
)
I got the exact same error message on 14 work units, after 2 seconds each on two GTX 750 Ti's , also on BOINC 7.4.36 (x64), running Win7 64-bit (347.52 drivers).
Bumped the two betas on host
)
Bumped the two betas on host 1001562 to run immediately.
WinXP/32, BOINC v6.12.34, GTX 750Ti factory overclock (no additional tuning). Running 2-up - the tasks have survived the first 10/5 minutes respectively. Previous tasks run in the same configuration show some loose change under 10 hours - that'll do as a speed check.
RE: ... suspended the
)
I'd like a little more exploration of that scenario, please.
One of the updates that Bernd referred to in the release thread was an API bugfix I drew to his attention. Some older applications (most commonly OpenCL, but one never knows) missed the 'request suspend' command if it was issued while the GPU support code was in a critical section. Could you perhaps re-try the two machines which failed, and
1) confirm that the 'suspended' tasks have truly exited and freed up the GPU memory, as they are supposed to.
2) try allowing the previous tasks to finish naturally (which is what I did, with short-running SETI tasks) and seeing if the replacement BRP6-Beta tasks start then.
Just trying to narrow down the trigger points for this error.
RE: I'd like a little more
)
I'll try. Sadly, my "success-oriented" first try did not involve suspending all save one of my beta WUs, so the failure blew through my entire supply. Subsequently I disabled Beta on those hosts. I'll attempt to get some fresh Beta work. If I succeed, I'll first try the "cleanest possible", by suspending everything, then doing a full cold reboot of the machine, then enabling a single Beta WU. If that runs I can crawl up from there.
OK, re-enabling beta request,
)
OK, re-enabling beta request, and extending my requested queue size got me some fresh Beta.
So on a first trial host I suspended ALL WUs, then did a full shutdown (power off), followed by reboot.
I then unsuspended a single beta task.
It promptly (3 seconds reported run time) errored out. The event log lines in boincmgr read:
The lines of likely interest in stderr read
Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)
Activated exception handling...
[09:41:51][4664][INFO ] Starting data processing...
[09:41:51][4664][ERROR] No suitable CUDA device available!
[09:41:51][4664][ERROR] Demodulation failed (error: 1001)!
09:41:51 (4664): called boinc_finish(1001)
So this appears to replicate the previous error. I confess I did not follow your specific recipe Richard, but my intention was to jump straight to the most likely to succeed scenario. I'm game to try other things, but for the moment I've returned the host to processing queue in order, which means non-beta Parkes work for several days.
I'll entertain suggestions on possibly useful trials in light of this additional result.
Got two v1.47 Beta running on
)
Got two v1.47 Beta running on host 831490 - clean pickup after other tasks finished.
That's Windows 7/32, BOINC v7.4.36, GTX 470 - which eliminates a couple of worries people had (Win7, BOINC v7.4) - though only at 32 bit. Unfortunately my Win 7/64, GTX 670 host was poorly this morning: running now, but one GPU down, so I'll explore further in the morning before trying this app. Don't want to confuse the mix with potentially bad hardware.
I got 2 beta tasks so
)
I got 2 beta tasks so far:
One on my Server2008 machine with a GTX580 and Boinc 7.4.36. Completed ok, not validated yet.
One on my Win7 machine with a 7970 and Boinc 7.4.36. Running ok, 15% at the moment.
Both the above tasks are now
)
Both the above tasks are now completed and pending validation. However there is one issue: Performance.
Normally I run 3 simultaneous tasks on both the 7970 and the GTX580. The 7970 is considerably faster, taking about 12k seconds per BRP6 task. The GTX580 takes about 18k seconds.
These new version 1.47 beta apps I ran as single tasks and the GTX580 was faster than the 7970. 4k seconds against 6k seconds. If I multiply these numbers by 3 and compare them to the non-beta application times I get:
GTX 580 : Old:18k Beta:12k
HD 7970 : Old:12k Beta:18k
So the new app seems much better for Cuda and much worse for OpenCl
We were warned by the powers
)
We were warned by the powers that be to expect much more variability driven by data in the WU for this beta application than for the stock one.
As in my configurations I have seen negligible variability on stock for both Perseus and Parkes, I did not pay this warning much mind, thinking that even a 10-fold increase in variability would still not be much.
But in my initial trials, all of which have run 2X, I'm starting to see some really remarkable unit-to-unit differences in elapsed time and in CPU time consumed by the support application. I'll post some more details later, but for the moment I'll just warn that basing much conclusion on single units, or even small samples, may give badly flawed conclusions if assumed to be the average behavior of the full ensemble.
We've been spoiled on Einstein by application/WU data combinations with excellent repeatability, allowing performance tuning and conclusions on tiny samples. That era may be over for GRP6.
RE: The lines of likely
)
I'm seeing the same error when I try these beta tasks on my GTX660Ti in Win7 x64.
I tried to upgrade the graphics driver to the latest (347.52) but to no avail.
I've tried to bump the beta tasks to the top of the queue by suspending all Nvidia GPU task and then resuming one single beta task, it then fails after about 3 seconds with the above error. I even tried to have one BRP4G tasks running and all other Nvidia GPU tasks suspended and then resume one beta task but that made no difference.
I've also noted that the stderr in the online database is empty although the info is available in both client_state.xml and sched_request_einstein.phys.uwm.edu.xml files. Copys of both files saved if anyone is interested.