Einstein won't allow other applications to run

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

18 Nov 2007 1:27:33 UTC

Topic 193328

(moderation:

)

I'm experiencing an issue where Einstein@Home takes full control of BOINC during the night on one of my machines. Even though it should be switching off every hour with the other attached project on that machine, E@H will continue to run until the BOINC daemon is stopped and restarted.

Once the BOINC daemon is manually restarted, the other project immediately starts running and keeps going until the time debt is balanced out again. Then Einstein will seem to swap off as normal for a few cycles, then gets stuck in control again.

If the E@H workunit completes, it will upload and then get stuck in memory, blocking the other project from restarting even though Einstein won't start the next unit because of the time debt. So if Einstein completes a workunit while it's stuck in this condition, the machine will go completely idle until the BOINC daemon is restarted.

I'm running Ubuntu 7.10 32-bit. I have tried updating the BOINC daemon from 5.8.16 to 5.10.21, without success. I have also tried resetting and detaching/reattaching Einstein from my machine, again without success.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119317466268

RAC: 25618304

Einstein won't allow other applications to run

18 Nov 2007 3:11:43 UTC

Message 75460

(moderation:

)

Quote:

... Einstein@Home takes full control of BOINC ...

It's the other way around. At all times BOINC makes the decision about which project should be allowed to run.

Quote:

... it should be switching off every hour with the other attached project on that machine ...

Not necessarily. If BOINC decides that the currently running project needs more time (for whatever reason) then BOINC will not switch tasks every hour. The most likely reason could be that BOINC thinks the EAH task may be in deadline trouble.

Quote:

Once the BOINC daemon is manually restarted, the other project immediately starts running and keeps going until the time debt is balanced out again. Then Einstein will seem to swap off as normal for a few cycles, then gets stuck in control again.

I don't really understand what you are saying here. Are you saying that by restarting BOINC, the alternate project will start and normal cycling between projects will resume for a while until EAH swaps in for an hour, then swaps out, and then almost immediately swaps in again and continues to run until completely finished? I imagine this is pretty well what would happen if BOINC suddenly thought that the EAH task was in deadline trouble.

Quote:

If the E@H workunit completes, it will upload and then get stuck in memory, blocking the other project from restarting even though Einstein won't start the next unit because of the time debt. So if Einstein completes a workunit while it's stuck in this condition, the machine will go completely idle until the BOINC daemon is restarted.

I really don't understand this. You say the task completes and starts uploading. But then you say the task is stuck in memory?? Are you saying that there is still an active Einstein crunching task showing up as still crunching on the task list (eg top)? Is it accumulating CPU time? Does the upload successfully complete? Perhaps you are talking about the upload process getting stuck rather than the EAH app getting stuck. Uploading is handled by BOINC and not the science app. It would be very helpful if you could describe things in more detail thanks. A stuck upload doesn't usually have any effect on the crunching of other tasks in your cache of work. I'm sorry but I'm quite at a loss on this last bit. I can't really recall seeing a machine go completely idle while there was still work in the cache.

Cheers,
Gary.

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

RE: It's the other way

18 Nov 2007 3:24:36 UTC

Message 75461 in response to message 75460

(moderation:

)

Quote:

It's the other way around. At all times BOINC makes the decision about which project should be allowed to run.

I realize this, though I realize I used a bit of inaccurate language in describing the issue. My apologies. :)

Quote:

Are you saying that by restarting BOINC, the alternate project will start and normal cycling between projects will resume for a while until EAH swaps in for an hour, then swaps out, and then almost immediately swaps in again and continues to run until completely finished?

At the point when I restart BOINC, the long term (and often short term debt as well) indicated in the client_state.xml file is around 10000-12000. Because of this outstanding debt, when BOINC is restarted, the alternate project will proceed to run uninterrupted for several hours, which I gather to be normal behavior. When the debt is near zero again, BOINC will swap tasks and run Einstein for an hour, then the alternate project for an hour, and repeat swapping the tasks for at least three iterations of this cycle (Einstein one hour, alternate project one hour). At some point, Einstein simply refuses to leave memory until BOINC is restarted. Even though BOINC attempts to swap projects, Einstein will not exit until the daemon and all children are forced closed.

Quote:

You say the task completes and starts uploading. But then you say the task is stuck in memory?? Are you saying that there is still an active Einstein crunching task showing up as still crunching on the task list (eg top)? Is it accumulating CPU time? Does the upload successfully complete?

The upload task successfully completes. The last message present in the Messages portion of BOINC indicates that Einstein has finished uploading successfully and the alternate task has resumed processing. However, the alternate task is still listed as "Ready to run", while Einstein is listed as "Ready to report". An examination of the processes running on the machine indicates that BOINC is running, the alternate task has loaded into memory, but Einstein is also still in memory and consuming a lowish number of CPU cycles (about 30% instead of 99.7%). I do not have "Leave applications in memory while preempted" selected in my BOINC preferences.

Once BOINC is restarted, the Einstein and alternate projects do swap out correctly for a while, so only one science app is loaded into memory at any one time.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119317466268

RAC: 25618304

RE: I realize this, though

18 Nov 2007 7:50:12 UTC

Message 75462 in response to message 75461

(moderation:

)

Quote:

I realize this, though I realize I used a bit of inaccurate language in describing the issue. My apologies. :)

No problem at all. These days I sometimes hesitate to answer questions like this because I'm never quite sure if I'll be accused of denigrating someone just by trying to be precise about what is actually happening. Thanks for understanding that I wasn't trying to put you down in any way. Obviously, from your more detailed description this time, there is something quite unusual going on here. Unfortunately I don't have any rational explanation to offer.

Quote:

At the point when I restart BOINC, the long term (and often short term debt as well) indicated in the client_state.xml file is around 10000-12000. Because of this outstanding debt, when BOINC is restarted, the alternate project will proceed to run uninterrupted for several hours, which I gather to be normal behavior.

Yes, this is normal behaviour although I don't understand why BOINC would allow a 3 hour debt to build up in the first place by failing to switch on the hour. I'm assuming when you talk about "alternate project" rather than "projects" that you only have two projects with equal resource shares on this particular machine. If so, then it is very hard to imagine any type of deadline problem which might cause BOINC to favour one project.

Quote:

When the debt is near zero again, BOINC will swap tasks and run Einstein for an hour, then the alternate project for an hour, and repeat swapping the tasks for at least three iterations of this cycle (Einstein one hour, alternate project one hour).

Absolutely as expected.

Quote:

At some point, Einstein simply refuses to leave memory until BOINC is restarted.

This is the bit I really don't get. I imagine BOINC sends some sort of signal to the science app telling it to suspend itself and then awaits the confirmation that the signal has been acted upon. I'm not a programmer - I'm only guessing. Since the process works on other machines, I'm guessing there is something in your particular hardware/OS setup that is somehow interfering with the process of signalling processes to suspend and resume. This interference must be somewhat random since the swapping works for a while before failing. Someone like Bernd might be interested in looking at this.

Quote:

Even though BOINC attempts to swap projects, Einstein will not exit until the daemon and all children are forced closed.

I'll draw this thread to Bernd's attention so he can have a look at this. Hopefully he will have some ideas.

Cheers,
Gary.

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

RE: I'm assuming when you

18 Nov 2007 15:33:33 UTC

Message 75463 in response to message 75462

(moderation:

)

Quote:

I'm assuming when you talk about "alternate project" rather than "projects" that you only have two projects with equal resource shares on this particular machine.

Correct. To be specific, this particular machine runs Einstein and World Community Grid, both with a resource share of 100.

Quote:

If so, then it is very hard to imagine any type of deadline problem which might cause BOINC to favour one project.

I certainly haven't been able to think of one. It may be worth noting that BOINC has spuriously reported a deadline issue twice in the last week that I've been tracking this issue. The first time was about six or seven days ago, when it said Einstein was in danger of overrunning the deadline. However, the time to completion was nowhere near the deadline. Restarting the daemon cleared this spurious message.

The second time was just yesterday, when BOINC reported the latest Einstein workunit, it returned and said it had missed the deadline, which was also incorrect. After updating a second time against the Einstein scheduler, this message cleared itself.

Quote:

Quote:
At some point, Einstein simply refuses to leave memory until BOINC is restarted.
This is the bit I really don't get.

I can understand that, as I haven't seen this kind of issue on any other of my machines running Einstein, all of which use the same OS and BOINC client version.

Quote:

This interference must be somewhat random since the swapping works for a while before failing.

I'm not sure if it really is random or not. The exact time does seem to be random, but it may also be worth noting that the general time when this happens is always between 10PM and 2AM, Pacific Standard Time. Why, I haven't the faintest idea.

Quote:

I'll draw this thread to Bernd's attention so he can have a look at this. Hopefully he will have some ideas.

Greatly appreciated! I'm certainly willing to do any troubleshooting necessary to help figure it out, and if it turns out to be spurious hardware issues, no worries all around. :) I just wondered if anybody had ever seen something like this before or had any ideas.

Keck_Komputers

Joined: 18 Jan 05

Posts: 376

Credit: 5744955

RAC: 0

@TXR13 It sounds like you may

18 Nov 2007 20:35:51 UTC

Message 75464

(moderation:

)

@TXR13
It sounds like you may have a problem with your computer's clock or CMOS battery. The time frame you are observing the problem is also the time frame you would expect the computer to reset it's clock with an internet time source.

Another possibility is queue length. The "connect every" setting influences how long BOINC thinks it has to work in before the deadline as well as how large a work supply to keep on hand. WCG has 9 day deadlines so anything over 4 days will cause problems. As you get closer to that limit you will see more temporary issues.

BOINC WIKI

BOINCing since 2002/12/8

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

RE: It sounds like you may

18 Nov 2007 20:58:05 UTC

Message 75465 in response to message 75464

(moderation:

)

Quote:

It sounds like you may have a problem with your computer's clock or CMOS battery. The time frame you are observing the problem is also the time frame you would expect the computer to reset it's clock with an internet time source.

Now that's interesting. I do have ntpd running on all my machines to keep their clocks sync'd appropriately. I also have a few machines who need new CMOS batteries, but this machine wasn't one of the those.

Quote:

Another possibility is queue length.

I don't have a large queue setting at all. The "connect every" option is set to 0.1 days for both WCG and Einstein.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119317466268

RAC: 25618304

RE: RE: It sounds like

18 Nov 2007 23:55:24 UTC

Message 75466 in response to message 75465

(moderation:

)

Quote:

Quote:
It sounds like you may have a problem with your computer's clock or CMOS battery. The time frame you are observing the problem is also the time frame you would expect the computer to reset it's clock with an internet time source.

Now that's interesting....

Yes, it's very interesting! Why don't you temporarily disable time syncing and see what happens to your clock's time keeping ability and, more importantly, if app swapping then becomes normal? I guess you'll already be doing this :).

BOINC can get mightily confused if largish adjustments are being made to the clock while it is running :).

Cheers,
Gary.

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

RE: Why don't you

19 Nov 2007 14:41:31 UTC

Message 75467 in response to message 75466

(moderation:

)

Quote:

Why don't you temporarily disable time syncing and see what happens to your clock's time keeping ability and, more importantly, if app swapping then becomes normal?

I halted ntpd so it was not running. The last application swap indicated in BOINC's messages pane occurred at approximately 12:03AM PST (from WCG to Einstein). From then until I restarted the BOINC client at 6:35AM PST, Einstein was the only application running. At the time of the client restart, both WCG and Einstein applications were loaded into memory.

After restarting the client, WCG immediately began processing, having accumulated a short term debt of over 14000, and a long term debt of over 11000.

Keck_Komputers

Joined: 18 Jan 05

Posts: 376

Credit: 5744955

RAC: 0

RE: RE: Why don't you

20 Nov 2007 0:24:30 UTC

Message 75468 in response to message 75467

(moderation:

)

Quote:

Quote:
Why don't you temporarily disable time syncing and see what happens to your clock's time keeping ability and, more importantly, if app swapping then becomes normal?

I halted ntpd so it was not running. The last application swap indicated in BOINC's messages pane occurred at approximately 12:03AM PST (from WCG to Einstein). From then until I restarted the BOINC client at 6:35AM PST, Einstein was the only application running. At the time of the client restart, both WCG and Einstein applications were loaded into memory.

After restarting the client, WCG immediately began processing, having accumulated a short term debt of over 14000, and a long term debt of over 11000.

Unfortunately halting the ntpd service may not always resolve this type of issue immediately even if that is the cause. Because BOINC timestamps things as they occur there may be bad timestamps causing problems for a long time after the clock is fixed. Deadlines can be reset to 1901, usage stats can be set to unreasonable values, and other nasties.

Speaking of usage stats, have you checked that? They are in a section of the client_stat.xml file called time_stats, the "on" and "active" frac should be between 0 and 1, the connected_frac is irrelevant.

BOINC WIKI

BOINCing since 2002/12/8

TXR13

Joined: 16 Aug 07

Posts: 8

Credit: 5273566

RAC: 0

RE: Speaking of usage

20 Nov 2007 1:10:25 UTC

Message 75469 in response to message 75468

(moderation:

)

Quote:

Speaking of usage stats, have you checked that? They are in a section of the client_stat.xml file called time_stats, the "on" and "active" frac should be between 0 and 1, the connected_frac is irrelevant.

0.674916
-1.000000
0.999898
...

I realize you said connected_frac is irrelevant, but is a negative value within the bounds of normality?

Einstein won't allow other applications to run

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports