trouble with BRP6 -Beta-cuda32-nv301

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958172931
RAC: 713142

RE: We were warned by the

Quote:
We were warned by the powers that be to expect much more variability ...


And so it came to pass. First two results are back from 1001562: although they both ran at the same time (maybe ~5 mins stagger in start times) on the same GTX 750 Ti, there's a 20% difference in run time, and a 400% difference in CPU time.

Mind you, even the slower of the two tasks is practically half the runtime of the v1.39 application that was running before - and even better on CPU time.

Stef
Stef
Joined: 8 Mar 05
Posts: 206
Credit: 110568193
RAC: 0

I don't get the linux version

I don't get the linux version 1.47 working. All jobs errored out.

[23:51:28][2115][INFO ] Application startup - thank you for supporting Einstein@Home!
[23:51:28][2115][INFO ] Starting data processing...
[23:51:28][2115][ERROR] Couldn't initialize CUDA driver API (error: 100)!
[23:51:28][2115][ERROR] Demodulation failed (error: 1020)!
23:51:28 (2115): called boinc_finish(1020)

With the recent driver 346.47
http://einsteinathome.org/task/487117512

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 1

Same issue here with my

Same issue here with my GTX660Ti host 10698787

7.4.36

Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)


Activated exception handling...
[06:42:27][8388][INFO ] Starting data processing...
[06:42:27][8388][ERROR] No suitable CUDA device available!
[06:42:27][8388][ERROR] Demodulation failed (error: 1001)!
06:42:27 (8388): called boinc_finish(1001)

All beta tasks have failed within the first 2-3 seconds of starting. Using driver 340.52 here. I have set two of my AMD GPU machines to allow the beta app and thus far (only a few minutes) they are running correctly.

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 507128262
RAC: 80408

Until now all wu's ended with

Until now all wu's ended with exit status 1001 and empty stderr

Edit: sorry, I forgot: both pc's are windows x64 pc's with nVidia cards / latest available drivers/non beta. One is win7 and one is win10

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958172931
RAC: 713142

Add my GTX 670 host 5744895

Add my GTX 670 host 5744895 to the list which can't run v1.47 Beta, but can run previous v1.39.

Win7/64, BOINC v7.4.36 - it's the first 64-bit machine I've tried the Beta on, but I don't think that's the whole story.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223904931
RAC: 1003522

Bernd posted in Technical

Bernd posted in Technical News that he had made a change which should help certain problems, but would require project reset on the affected hosts to propagate the change. As he mentioned the "No suitable CUDA device" error, which is part of our syndrome (though he also referred to "New clients" suggesting he might be referring to another), I wondered if this "fix" was for the problem discussed here, and did a project reset on one of my two affected clients. All this got me was complete loss of my work in progress and a quick succession of additional failures on newly downloaded beta work of the excess recursion/stack overflow/no suitable device type we discuss here. I quickly disabled acceptance of beta applications, to get the machine back in constructive work on non-beta Parkes.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958172931
RAC: 713142

RE: Bernd posted in

Quote:
Bernd posted in Technical News that he had made a change which should help certain problems, but would require project reset on the affected hosts to propagate the change. As he mentioned the "No suitable CUDA device" error, which is part of our syndrome (though he also referred to "New clients" suggesting he might be referring to another), I wondered if this "fix" was for the problem discussed here, and did a project reset on one of my two affected clients. All this got me was complete loss of my work in progress and a quick succession of additional failures on newly downloaded beta work of the excess recursion/stack overflow/no suitable device type we discuss here. I quickly disabled acceptance of beta applications, to get the machine back in constructive work on non-beta Parkes.


Let me do some digging around to figure out exactly what he changed, and whether it applies duct tape to either of our problems ("Recursion too deep; the stack overflowed." or "No suitable CUDA device available!"). The latter was the only one he mentioned, and I've added my comments to the Tech thread.

Won't be until later this afternoon, when current tasks finish.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 728115804
RAC: 1197322

Hi all Many thanks for

Hi all

Many thanks for testing this, and sorry for the inconveniences.

The problem reported here with respect to the BRP6 CUDA Beta apps is sufficiently widespread so I deprecated these app versions for the moment, we will look into this in more detail on Monday.

I have a suspicion that our update of the BOINC API lib version, compared to the official app, has some role in this. The part of the BOINC API code that selects the "right" GPU for an app instance to run on was always a bit tricky and underwent many changes, I would not be surprised to find out that between this mechanism and our app code that deals with this, something broke (again).

Again, thanks for testing, detecting problems like these is the whole point of a Beta test. I let the OpenCL versions of the Beta apps active for the moment.

Cheers
HB

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2958172931
RAC: 713142

Deprecating the app makes it

Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P

Never mind, I've got the Beta files now, and enough experience with 'rebranding' tasks from one app_version to another, to leave me some scope for experimenting later. I'll do my best to only trash one task at a time...

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 728115804
RAC: 1197322

RE: Deprecating the app

Quote:

Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P

Never mind, I've got the Beta files now, and enough experience with 'rebranding' tasks from one app_version to another, to leave me some scope for experimenting later. I'll do my best to only trash one task at a time...

I think we nailed down the problem already pretty well, and I really don't want to flood volunteers over the weekend with failing tasks until their daily quota is running out. Thanks for all the useful input!

Cheers
HB

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.