Problems with new ABP2cuda23 calculations ?

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 46
Credit: 9584156580
RAC: 1228437
Topic 195220

Recently the ABP2cuda23 workunits were stopped, and then they started again but with a shorter calculation time, perhaps about one-half of what they were before. They seemed to be calculating fine for awhile, but then earlier today they started to be reported with "Error while Computing", but only, of course, after pretty much reaching the end of the calculations (on my machine, about 1 hour 12 minutes or so on average. I first thought there was just one bad work unit, but all of them right now are coming up with this error. See work units 78855915 or 78848870 or 78845539 for example (there are others as well). I cannot read the task output files easily myself, but maybe some experts out there can see something quickly.

I also noticed that instead of saying the work units were using 1 CPU and 1 GPU (as they have for months now) they now say using 0.29 CPU and 1 GPU .... while this is certainly better for computer utilization, has something changed in the code recently that might be causing this error ?

Cheers, Richard

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 0

Problems with new ABP2cuda23 calculations ?

See this thread on the changes on ABP2.

As for your error, I see that they end with:

Maximum elapsed time exceeded

And then the ABP2 application seems to crash. Not a very nice ending.
It's something the Einstein developers have to look into.

Crashed executable name: einsteinbinary_ABP2_5.11_i686-apple-darwin__ABP2cuda23
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.6.4 build 10F569
Mon Aug  2 07:00:45 2010

0 0x0018b9f0 SIGPIPE: write on a pipe with no reader
SIGPIPE: write on a pipe with no reader
1 0x001798b0 SIGPIPE: write on a pipe with no reader
2 0x0017b00f SIGPIPE: write on a pipe with no reader
3 0x0017b23b SIGPIPE: write on a pipe with no reader
4 0x9976781d SIGPIPE: write on a pipe with no reader
5 0x997676a2
Thread 1 crashed with X86 Thread State (32-bit):
eax: 0xffffffe1 ebx: 0x00000003 ecx: 0xb0003afc edx: 0x9973a0fa
edi: 0x00000000 esi: 0x00000000 ebp: 0xb0003b38 esp: 0xb0003afc
ss: 0x0000001f efl: 0x00000206 eip: 0x9973a0fa cs: 0x00000007
ds: 0x0000001f es: 0x0000001f fs: 0x0000001f gs: 0x00000037

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4343
Credit: 252788952
RAC: 41097

Could be related to the new

Could be related to the new scheduler. We're looking into that.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4343
Credit: 252788952
RAC: 41097

RE: And then the ABP2

Message 98782 in response to message 98780

Quote:
And then the ABP2 application seems to crash.

I think that this is an (unwanted) side effect of the Client killing the application that ran longer that the Client expected.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4343
Credit: 252788952
RAC: 41097

Can you reconstruct from your

Can you reconstruct from your Client messages when you received the task in question?

When we changed the scheduler yesterday there were a few minutes when it ran with wrong plan class settings. You probably got it during that time. This is our fault, sorry for that. Things should be in order again.

BM

BM

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 46
Credit: 9584156580
RAC: 1228437

I checked the task list, as I

Message 98784 in response to message 98783

I checked the task list, as I had rebooted the computer as part of my diagnostics and lost all the current messages. I got a whole pile of work units (121 tasks in fact) as follows : the first one that gave an error was sent 2 Aug 2010 5:29:18 UTC and the last one I have was sent 2 Aug 2010 12:45:53 UTC, but the vast majority (about 110 of them) were in the 12:24 to 12:26 area.

I've run 26 of them so far, all with the same error. Right now I have the rest of them suspended ... should I abort them and let the client reload ?

Regards
Richard

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: I checked the task

Message 98785 in response to message 98784

Quote:
I checked the task list, as I had rebooted the computer as part of my diagnostics and lost all the current messages.


Those messages are stored in the file stdoutdae.txt in your BOINC data directory. The name is for windows but should be similar on other platforms.

Quote:
Right now I have the rest of them suspended ... should I abort them and let the client reload ?


If you are adventurous, you could edit your client_state.xml file, but be careful and make backup copies before you try, because errors can trash your whole cache. I found the following in a thread on the SETI forum (Message 1019942):

Quote:
The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all with 3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors.


Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.