Problems with new ABP2cuda23 calculations ?

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9584156580

RAC: 1228437

2 Aug 2010 23:01:09 UTC

Topic 195220

(moderation:

)

Recently the ABP2cuda23 workunits were stopped, and then they started again but with a shorter calculation time, perhaps about one-half of what they were before. They seemed to be calculating fine for awhile, but then earlier today they started to be reported with "Error while Computing", but only, of course, after pretty much reaching the end of the calculations (on my machine, about 1 hour 12 minutes or so on average. I first thought there was just one bad work unit, but all of them right now are coming up with this error. See work units 78855915 or 78848870 or 78845539 for example (there are others as well). I cannot read the task output files easily myself, but maybe some experts out there can see something quickly.

I also noticed that instead of saying the work units were using 1 CPU and 1 GPU (as they have for months now) they now say using 0.29 CPU and 1 GPU .... while this is certainly better for computer utilization, has something changed in the code recently that might be causing this error ?

Cheers, Richard

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 0

Problems with new ABP2cuda23 calculations ?

2 Aug 2010 23:10:48 UTC

Message 98780

(moderation:

)

See this thread on the changes on ABP2.

As for your error, I see that they end with:

Maximum elapsed time exceeded

And then the ABP2 application seems to crash. Not a very nice ending.
It's something the Einstein developers have to look into.

Crashed executable name: einsteinbinary_ABP2_5.11_i686-apple-darwin__ABP2cuda23
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.6.4 build 10F569
Mon Aug  2 07:00:45 2010

0 0x0018b9f0 SIGPIPE: write on a pipe with no reader
SIGPIPE: write on a pipe with no reader
1 0x001798b0 SIGPIPE: write on a pipe with no reader
2 0x0017b00f SIGPIPE: write on a pipe with no reader
3 0x0017b23b SIGPIPE: write on a pipe with no reader
4 0x9976781d SIGPIPE: write on a pipe with no reader
5 0x997676a2
Thread 1 crashed with X86 Thread State (32-bit):
eax: 0xffffffe1 ebx: 0x00000003 ecx: 0xb0003afc edx: 0x9973a0fa
edi: 0x00000000 esi: 0x00000000 ebp: 0xb0003b38 esp: 0xb0003afc
ss: 0x0000001f efl: 0x00000206 eip: 0x9973a0fa cs: 0x00000007
ds: 0x0000001f es: 0x0000001f fs: 0x0000001f gs: 0x00000037

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252788952

RAC: 41097

Could be related to the new

3 Aug 2010 8:45:33 UTC

Message 98781

(moderation:

)

Could be related to the new scheduler. We're looking into that.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252788952

RAC: 41097

RE: And then the ABP2

3 Aug 2010 12:53:41 UTC

Message 98782 in response to message 98780

(moderation:

)

Quote:

And then the ABP2 application seems to crash.

I think that this is an (unwanted) side effect of the Client killing the application that ran longer that the Client expected.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252788952

RAC: 41097

Can you reconstruct from your

3 Aug 2010 12:59:39 UTC

Message 98783

(moderation:

)

Can you reconstruct from your Client messages when you received the task in question?

When we changed the scheduler yesterday there were a few minutes when it ran with wrong plan class settings. You probably got it during that time. This is our fault, sorry for that. Things should be in order again.

Richard de Lhorbe

Joined: 15 Dec 05

Posts: 46

Credit: 9584156580

RAC: 1228437

I checked the task list, as I

4 Aug 2010 2:52:05 UTC

Message 98784 in response to message 98783

(moderation:

)

I checked the task list, as I had rebooted the computer as part of my diagnostics and lost all the current messages. I got a whole pile of work units (121 tasks in fact) as follows : the first one that gave an error was sent 2 Aug 2010 5:29:18 UTC and the last one I have was sent 2 Aug 2010 12:45:53 UTC, but the vast majority (about 110 of them) were in the 12:24 to 12:26 area.

I've run 26 of them so far, all with the same error. Right now I have the rest of them suspended ... should I abort them and let the client reload ?

Regards
Richard

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: I checked the task

4 Aug 2010 8:35:49 UTC

Message 98785 in response to message 98784

(moderation:

)

Quote:

I checked the task list, as I had rebooted the computer as part of my diagnostics and lost all the current messages.

Those messages are stored in the file stdoutdae.txt in your BOINC data directory. The name is for windows but should be similar on other platforms.

Quote:

Right now I have the rest of them suspended ... should I abort them and let the client reload ?

If you are adventurous, you could edit your client_state.xml file, but be careful and make backup copies before you try, because errors can trash your whole cache. I found the following in a thread on the SETI forum (Message 1019942):

Quote:

The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all with 3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors.

GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Problems with new ABP2cuda23 calculations ?

Forums › Problems and Bug Reports

Problems with new ABP2cuda23 calculations ?

Could be related to the new

RE: And then the ABP2

Can you reconstruct from your

I checked the task list, as I

RE: I checked the task

Comment viewing options

Forums › Problems and Bug Reports