I've been getting these EAH WUs that take about 75 hours, which is fine, but as soon as one is finished my BOINC immediately downloads another even though EAH has a negative debt over 900K (I give EAH 14% resource share). Malaria has 50% resource share and a positive debt of almost 900K, but the last few of their WUs my system downloaded only because I suspended EAH long enough to force it to do so.
My question is: Do these EAH units have some kind of high priority that overrides my resource allocation, or is my system malfuntioning? If the former, I'll be happy to cooperate, but if not what can I do to change this?
Copyright © 2024 Einstein@Home. All rights reserved.
Long WUs high priority?
)
No definitely not. All projects run at low priority and it is BOINC and not the project that determines when a project is to be run and for how long.
Only you can really determine that :). It's not normally possible to keep getting more work for a project that is already way in excess of its fair share as shown by the long term debt values. If the debts really are as you say it would normally be impossible for EAH to get more work unless every other project was unable to supply work at the time that BOINC was asking for it. You only mention two projects and you give resource shares that don't add to 100%. Your results list shows 4 EAH results, two of which were received and aborted back on June 22. The 3rd one you also received on June 22 (about 4 hours later) and after about 11 days it was completed and returned. The 4th result was received about 10mins after your third was uploaded. This seems to indicate that perhaps BOINC did try to get work elsewhere before taking more from EAH. How much other work did you have on board at the time the 3rd result was finished?
Unless you can give a lot more information about how things work on your machine, all anybody trying to help can do is engage in a bit of speculation about possible scenarios. You need to specify all projects, their resource shares and debts. You need to specify how many hours per day your machine is running, what sort of cache of work you try to keep and whether or not you have done any "micromanaging" of your BOINC client. You also need to give details of how often other projects have been getting and returning work and any actions you may have taken to encourage/discourage their work fetch arrangements.
BOINC does work in mysterious ways sometimes and casual observation can often make you think that it is behaving irrationally. However, most often, BOINC knows exactly what it is doing and the best advice is to have patience, sit back, leave it alone and let it do its thing :).
EDIT: I meant to also state that BOINC would normally be rather unhappy with a 14% share for EAH if it knew in advance that each result would take 75 hours to complete. It would know that under those conditions and sticking faithfully to your resource share, a result would take more than 22 days elapsed time to complete, with no safety margin and 24/7 operation needed. So how did you fool BOINC into getting EAH work in the first place? :).
Cheers,
Gary.
Thanks for your prompt reply.
)
Thanks for your prompt reply. Here's a whole bunch of facts and figures (you asked for it:)) about the current state of my BOINC client (as of about 2:15 MDT, Wednesay, July 11) as shown on BOINC Manager 5.8.16 and Boinc Debt Viewer 0.3.2:
PROJECT WORK DONE RESOURCE SHARE LTD
Rectilinear Crossing 1410.25 210(14%) 21478.826967
Einstein@Home 8816.90 210(14%) -978092.369060
lchathome 163.02 210(14%) 5.542967 *
chess960@home 713.42 120(8%) 23344.474918
malariacontrol.net 1931.52 750(50%) 933263.524209
*zeroed out a few days ago when detached and reattached for new URL, before that was around 40-50K since I haven't had work for them since March.
The "Tasks" tab shows that EAH unit 66.411% finished (CPU Time 39:57, To Completion 24:27). I also have 3 Rectilinear (RCN) WUs sitting waiting to start.
Of these 5 projects, chess has work about half the time and lch only occassionally, so I sometimes interfere by suspending the other projects (which always have work) to get work from these 2 when it's there. Other than that, I really haven't micro-managed BOINC - until lately.
//EDIT: I meant to also state that BOINC would normally be rather unhappy with a 14% share for EAH if it knew in advance that each result would take 75 hours to complete. It would know that under those conditions and sticking faithfully to your resource share, a result would take more than 22 days elapsed time to complete, with no safety margin and 24/7 operation needed. So how did you fool BOINC into getting EAH work in the first place? :).//
The answer to this question is actually quite simple. BOINC shows that my computer is on 43-45% of the time and BOINC runs 99.7% of that. While it's true that 14% of THAT is only about 90 mins a day, on a low-power computer like mine BOINC spends a good deal of the time in "earliest-deadline first because computer is over-committed" scheduling mode. (This also comes up at chess where the WUs have 24 hour turnaround.) So when it's got too big a mouthful to finish at 14%, it does EAH (or chess) full-time until it makes the deadline(s) and has run up a significant negative debt, then doesn't accept work from that project until the others catch up. This has always worked the way it's supposed to - until these looooong EAH WUs came along. Now it finishes one and immediatly downloads another, no matter how big the negative debt has gotten.
//Your results list shows 4 EAH results, two of which were received and aborted back on June 22. The 3rd one you also received on June 22 (about 4 hours later) and after about 11 days it was completed and returned. The 4th result was received about 10mins after your third was uploaded. This seems to indicate that perhaps BOINC did try to get work elsewhere before taking more from EAH. How much other work did you have on board at the time the 3rd result was finished?//
I had done at least 2 (maybe 3) of these long units before these 4 still showing on my results page, so on June 22 when it started downloading yet another I aborted the download in exasperation (I stupidly forgot to to set "no new tasks" BEFORE aborting and it IMMEDIATELY began downloading another, which I also aborted.), then after it did some other work for a few hours I allowed it to complete a download. When I returned that WU on July 3, I waited for other projects to download before I let EAH give me another. I also occasionally suspended EAH to get other work, but it's done mostly EAH for at least a month.
//BOINC does work in mysterious ways sometimes and casual observation can often make you think that it is behaving irrationally. However, most often, BOINC knows exactly what it is doing and the best advice is to have patience, sit back, leave it alone and let it do its thing :).//
I've tried, but I'm running out of patience and wish I could get a better explanation (or plan of action).
Thanks,
Lessa
Thanks for all the detail you
)
Thanks for all the detail you have given. It makes it much easier for trying to picture what is happening. One thing you don't seem to have mentioned is your cache setting as controlled by the "connect to network every X days" preference setting. It would have a significant influence if it was too large. Also check that your "switch between projects every ..." is the default value of 60 mins.
Some general comments first. With the values you have listed, EAH should not be getting more work for a long time to come. BOINC should be blocking it for about the next three months or more :). Having said that, I know that suspending and resuming various projects can cause unexpected consequences - particularly if the resource share of the suspended project is large - like that of malaria. That project is supposed to be getting a big share but it doesn't seem to have done all that much work. Do you get regular work from that project and do you ever suspend it? If there's no difficulty getting malaria work it's very hard to understand why BOINC is allowing it to accumulate such a large LTD. Your cache should be awash with malaria WUs :). Another thing to look at is the value for DCF (duration correction factor) that is stored in your state file (client_state.xml) for each project. It would be good to report those five values just to see if any look to be "unusual" :).
If I had your problem, here's what I would do. First off I would check and then reduce if necessary the "connect to network" setting to a very low value like 0.01 days so that no project should attempt to get a whole lot of work. BOINC is far more likely to get itself sorted out quickly if it's not trying to manage a lot of work. There's no indication that your value is too large but you never know. If your value is more than say 1.0 days it is probably too large.
Second, I would resolve not to touch the "suspend" option if at all possible just to force other projects to get extra work. If you suspend the majority of your projects, the remaining ones can think they have full access to all of your cache and can download as if they had 100% share. It's then much harder for BOINC to sort out the resulting mess. Also, I have seen some strange things happen when projects get randomly suspended and resumed.
For those projects like LHC which rarely have work, consider giving them a much higher resource share. Most of the time they wont be able to use it so the other projects will take it back anyway. Then when LHC suddenly has work it can grab a bigger slice because it will have both a higher resource share and a bigger LTD to assist it.
Thirdly, as things are quite out of whack anyway, I would reset all STDs and LTDs to zero just to give BOINC a fresh start. If you decide to change any resource shares, do it first on the website of each project and then update each project in your Boinc Manager to make sure the new resource share has "taken". Then stop BOINC, reset all debts, and restart BOINC.
When BOINC restarts, the low "connect" setting should prevent it from getting too much work from any one project. All projects that have work to give should supply exactly one work unit only. If you have a single EAH work unit, BOINC will probably calculate that it is overcommited and will finish it first. BOINC should then refuse to get any more from EAH for a while.
It may take quite some time for things to settle down but generally speaking you should see BOINC switching regularly between projects that have work. After a week or two when things seem to be going properly, you could consider slowly and carefully increasing your cache in a couple of stages from 0.01 days to something like 0.5 - 0.7 days or so. Allow a couple of days between each increase to get a feel for what is happening and to allow BOINC to cope with the adjustment. Back out the change if there appear to be adverse consequences. Just remember that sudden large changes can cause havoc :).
Good luck and let us know how things turn out.
Cheers,
Gary.
Thanks again for your help.
)
Thanks again for your help. Right now I have set everything to "no new tasks" and will wait for everything to empty out before making changes. In the meantime, I thought I'd give you my DCF for each project and see if you see anything amiss there, because I have no idea what the numbers mean (I'm only modestly computer-literate).
RCN ------ 1.518205
EAH ------ 0.885589
LHC ------ 1.010644
chess ---- 0.965422
malaria -- 1.126823
///One thing you don't seem to have mentioned is your cache setting as controlled by the "connect to network every X days" preference setting... Also check that your "switch between projects every ..." is the default value of 60 mins.///
I figured in that big clump of facts and figures I'd forget something :). I've had the "connect" setting at 4 days (again because of LCH - to get more units when they're available), so I will try reducing that, as well as your suggestion about resource share. And my "switch" time is at the default.
I hope zeroing the debts is as simple as that button that says "clear debts"?:)?
So please let me know if you see something in those DCF numbers. And I'll let you know how it all turns out.
Thanks again,
Lessa
RE: I've had the "connect"
)
Ahhh.... That's probably going to turn out to be the source of much wierdness particularly when you start suspending projects. Other projects (eg EAH) can take that as an open invitation to grab a lot of work at that moment. Now that I know that, I would guess that you will probably be able to get much better behaviour just by reducing that 4 day value to 0.5 day. Do that straight away, even before your cache runs down due to "no new tasks" as this will give BOINC time to adjust to the new value. Over the coming days, as project queues get emptied, try allowing new tasks one project at a time to see if that project will then grab a sane amount of new work. Malaria with its 50% share and positive LTD should always be trying to keep its cache full. That's OK as it's got some catching up to do :).
I'm aware of the existence of Boinc Debt Viewer but have no experience with it so I'd be checking any READMEs that come with the package or any FAQs on the website where you got it, just to make sure about that point. You would hope that a button so labelled would be able to do the job without trashing your state file but you are wise to be a little concerned :). If you are going to use that button make sure BOINC is fully shut down first.
EDIT: Actually, you could put that action on hold for the moment to see if the "connect to network" reduction fixes things by itself. As your preference was to give malaria the biggest slice of your time, that's exactly what should happen with the debts the way they are. It would be interesting to see if BOINC is now able to get itself back on track without interfering with the debts :).
The DCFs all look reasonable. The default is 1.0 and the values move up or down slightly over time to correct the project estimate for how long a result should take on your particular computer.
Cheers,
Gary.
BOINC Debt Viewer is just as
)
BOINC Debt Viewer is just as easy as it looks.
Just make sure you have completed closed down BOINC before you hit that button.
A pop up will remind you though.
But I agree with Gary. I'd hold off on doing that. See if reducing the connect interval gives you better behavior.
Kathryn :o)
Einstein@Home Moderator
IT WORKIED!!! I changed the
)
IT WORKIED!!! I changed the connect time to 0.5 days and when my computer finished its latest EAH this afternoon it DIDN'T get another. So it's started the work of catching up all those debts on the other projects (EAH got to negative 1.04M!). I've now increased the connect time to 1.0 day, and I guess that's as far as i'd better go, at least on this computer.
Thank you both very much for your help, and hopefully I'll give EAH more help in the future (but not too soon!).
Lessa
You're welcome! :).
)
You're welcome! :).
Cheers,
Gary.