trouble with BRP6 -Beta-cuda32-nv301

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958172931

RAC: 713142

RE: We were warned by the

27 Feb 2015 22:43:41 UTC

Message 130424 in response to message 130422

(moderation:

)

Quote:

We were warned by the powers that be to expect much more variability ...

And so it came to pass. First two results are back from 1001562: although they both ran at the same time (maybe ~5 mins stagger in start times) on the same GTX 750 Ti, there's a 20% difference in run time, and a 400% difference in CPU time.

Mind you, even the slower of the two tasks is practically half the runtime of the v1.39 application that was running before - and even better on CPU time.

Stef

Joined: 8 Mar 05

Posts: 206

Credit: 110568193

RAC: 0

I don't get the linux version

27 Feb 2015 23:22:37 UTC

Message 130425

(moderation:

)

I don't get the linux version 1.47 working. All jobs errored out.

[23:51:28][2115][INFO ] Application startup - thank you for supporting Einstein@Home!
[23:51:28][2115][INFO ] Starting data processing...
[23:51:28][2115][ERROR] Couldn't initialize CUDA driver API (error: 100)!
[23:51:28][2115][ERROR] Demodulation failed (error: 1020)!
23:51:28 (2115): called boinc_finish(1020)

With the recent driver 346.47
http://einsteinathome.org/task/487117512

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644337738

RAC: 1

Same issue here with my

28 Feb 2015 9:47:26 UTC

Message 130426

(moderation:

)

Same issue here with my GTX660Ti host 10698787

7.4.36

Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)

Activated exception handling...
[06:42:27][8388][INFO ] Starting data processing...
[06:42:27][8388][ERROR] No suitable CUDA device available!
[06:42:27][8388][ERROR] Demodulation failed (error: 1001)!
06:42:27 (8388): called boinc_finish(1001)

All beta tasks have failed within the first 2-3 seconds of starting. Using driver 340.52 here. I have set two of my AMD GPU machines to allow the beta app and thus far (only a few minutes) they are running correctly.

Alex

Joined: 1 Mar 05

Posts: 451

Credit: 507128262

RAC: 80408

Until now all wu's ended with

28 Feb 2015 11:04:51 UTC

Message 130427

(moderation:

)

Until now all wu's ended with exit status 1001 and empty stderr

Edit: sorry, I forgot: both pc's are windows x64 pc's with nVidia cards / latest available drivers/non beta. One is win7 and one is win10

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958172931

RAC: 713142

Add my GTX 670 host 5744895

28 Feb 2015 13:09:52 UTC

Message 130428

(moderation:

)

Add my GTX 670 host 5744895 to the list which can't run v1.47 Beta, but can run previous v1.39.

Win7/64, BOINC v7.4.36 - it's the first 64-bit machine I've tried the Beta on, but I don't think that's the whole story.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7223904931

RAC: 1003522

Bernd posted in Technical

28 Feb 2015 14:13:46 UTC

Message 130429

(moderation:

)

Bernd posted in Technical News that he had made a change which should help certain problems, but would require project reset on the affected hosts to propagate the change. As he mentioned the "No suitable CUDA device" error, which is part of our syndrome (though he also referred to "New clients" suggesting he might be referring to another), I wondered if this "fix" was for the problem discussed here, and did a project reset on one of my two affected clients. All this got me was complete loss of my work in progress and a quick succession of additional failures on newly downloaded beta work of the excess recursion/stack overflow/no suitable device type we discuss here. I quickly disabled acceptance of beta applications, to get the machine back in constructive work on non-beta Parkes.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958172931

RAC: 713142

RE: Bernd posted in

28 Feb 2015 14:46:54 UTC

Message 130430 in response to message 130429

(moderation:

)

Quote:

Bernd posted in Technical News that he had made a change which should help certain problems, but would require project reset on the affected hosts to propagate the change. As he mentioned the "No suitable CUDA device" error, which is part of our syndrome (though he also referred to "New clients" suggesting he might be referring to another), I wondered if this "fix" was for the problem discussed here, and did a project reset on one of my two affected clients. All this got me was complete loss of my work in progress and a quick succession of additional failures on newly downloaded beta work of the excess recursion/stack overflow/no suitable device type we discuss here. I quickly disabled acceptance of beta applications, to get the machine back in constructive work on non-beta Parkes.

Let me do some digging around to figure out exactly what he changed, and whether it applies duct tape to either of our problems ("Recursion too deep; the stack overflowed." or "No suitable CUDA device available!"). The latter was the only one he mentioned, and I've added my comments to the Tech thread.

Won't be until later this afternoon, when current tasks finish.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 728115804

RAC: 1197322

Hi all Many thanks for

28 Feb 2015 15:10:01 UTC

Message 130431

(moderation:

)

Hi all

Many thanks for testing this, and sorry for the inconveniences.

The problem reported here with respect to the BRP6 CUDA Beta apps is sufficiently widespread so I deprecated these app versions for the moment, we will look into this in more detail on Monday.

I have a suspicion that our update of the BOINC API lib version, compared to the official app, has some role in this. The part of the BOINC API code that selects the "right" GPU for an app instance to run on was always a bit tricky and underwent many changes, I would not be surprised to find out that between this mechanism and our app code that deals with this, something broke (again).

Again, thanks for testing, detecting problems like these is the whole point of a Beta test. I let the OpenCL versions of the Beta apps active for the moment.

Cheers
HB

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958172931

RAC: 713142

Deprecating the app makes it

28 Feb 2015 15:24:02 UTC

Message 130432 in response to message 130431

(moderation:

)

Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P

Never mind, I've got the Beta files now, and enough experience with 'rebranding' tasks from one app_version to another, to leave me some scope for experimenting later. I'll do my best to only trash one task at a time...

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 728115804

RAC: 1197322

RE: Deprecating the app

28 Feb 2015 16:42:15 UTC

Message 130433 in response to message 130432

(moderation:

)

Quote:

Deprecating the app makes it a touch difficult to carry on testing and isolating the problem ;) :P

Never mind, I've got the Beta files now, and enough experience with 'rebranding' tasks from one app_version to another, to leave me some scope for experimenting later. I'll do my best to only trash one task at a time...

I think we nailed down the problem already pretty well, and I really don't want to flood volunteers over the weekend with failing tasks until their daily quota is running out. Thanks for all the useful input!

Cheers
HB

trouble with BRP6 -Beta-cuda32-nv301

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports