I received the following query by PM.
For some reason, my machine is receiving tasks that have a very bloated estimation for time to compile, resulting in BOINC immediately placing it into the 'High Priority' mode. It's a dual CPU machine, and actual times are maybe 10% of estimate.
Do I just wait this out, or do I need to do something about it ???
My policy is not to respond to standard queries by PM - for two main reasons.
1) There is a huge 'education' opportunity being missed when responding by PM. I always take the view, supported by the large number of people who lurk on these boards, that it's the lurkers rather than the OP who stand to benefit most from a detailed response. I often get accused of being too verbose but that doesn't bother me. I'm targeting my response at people who perhaps know less than the OP or who may have closely related queries that they don't feel competent to pose.
2) There are a lot of good people who answer queries on these boards. Some are undoubtedly more competent to answer certain questions than I am. It's good to share the load of answering questions with all the other volunteers who choose to do so.
The quick answer to the question is to tell the OP that the problem is most likely a DCF (duration correction factor) that is out of whack. He has two choices.
A) He can let BOINC deal with it as BOINC will gradually move the DCF to a better value (at the rate of 10% of the difference amount for each fresh result that completes. It could take a week or two, depending on the speed of the machine and the number of cores.
B) He can stop BOINC and edit the state file (client_state.xml) to make the full adjustment by hand and then restart BOINC. If you are comfortable with editing .xml files, this is the way to go. If you make a mistake during editing you can easily lose your entire cache of work. This option is not recommended for people who are just donating computing resources and who wish to 'set and forget.'
The long answer (the one I would always try to give) is to go into some scenarios as to why the DCF might have got out of whack in the first place. If the Devs have got things in an 'optimal state', the DCF will be close to 1.0 and tasks will be taking pretty close to the estimate. With E@H there is a variability that is not easily predicted so the 'optimal state' is never going to be fully achieved. Also different platforms have different behaviours so whilst one host might truly have a DCF close to 1.0, others might need values of say 0.7 or lower or even up to 1.5 or higher, just to keep the estimated time close to the actual time. These variations would be quite normal.
The OP mentioned estimates that were wrong by a factor of 10 (actual time is only 10% of estimated time). This situation is never 'normal'. The DCF must be of the order of 10 rather than the hoped for value of around 1.0. A quick inspection of the state file, searching for the tag (in the Einstein project section of the state file - be careful if you have multiple projects) would soon confirm the diagnosis. If everything seems to stack up as expected, a quick fix would be to shift the decimal point one place to the left - that is reduce the value by a factor of 10.
Why would the DCF get so out of whack? I can immediately think of two possibilities and others may know of more.
1) A bad CPU benchmark run could severely miscalculate the CPU capabilities which would cause the DCF to be adjusted correspondingly upwards or downwards from the proper value.
2) A significant CMOS time adjustment being made while BOINC is running. In the middle of a run, imagine that the CMOS time was adjusted by a large amount because someone noticed a date error. When the task completed BOINC could be fooled into thinking that the runtime was vastly different to what it should have been and could make a corresponding large adjustment to DCF.
So if something like this may have happened, the appropriate action would be to set the DCF value back to what it should be, relying on one of the two approaches listed earlier.
Cheers,
Gary.
Copyright © 2024 Einstein@Home. All rights reserved.
The estimate of Crunch Time is too high
)
Gary -
Since I am the originator of this question, I want to thank you for placing it here. Perhaps others will find my comments helpful ...
As of right now, I am letting BOINC deal with the issue. When I originally wrote to you, the WUs that I was receiving carried an Estinated Completion Time (ECT) of 375 hours. Since the WUs have a return window of two weeks, this immediately forced BOINC into the 'High Priority' mode, in an effort to make the two week date. (168 hours/week, two weeks = 336 hours). I do not know what the Duration Correction Factor (DCF) was at that time.
The most recent WU received from Einstein@home carries an ECT of 264 hours. Since that is less than two weeks, the system no longer sets 'High Priority'. The DCF for the Einstein project is currently 60.xxxxxx. I suspect that BOINC is doing its thing, and slowly resetting the DCF each time my system returns a completed WU.
None of the other projects that I donate time to suffer from this problem, so I suspect someting happened a couple of months ago to create this problem. Now, as to what caused it, I have no idea.
My CMOS is slaved to SYMMTIME, which locks my clock/time to the atomic clock in Denver, so there should not be/have been any massive hiccups there. As for a benchmark error, while possible, I would think that such an error as that would affect all my projects, not just one. In any event, I will keep watch on this issue, and make additional comments if/when the need arises. Thanks ....
If I've lived this long - I gotta be that old!
RE: The most recent WU
)
A couple of comments. As the error in the estimate is over 250 hours, BOINC should be reducing the estimate for new tasks by around 25 hours each time a task completes. Can you see this behaviour happening?
Such a large error must be playing havoc with your cache of work. You would only be able to get extra tasks for each CPU when the current task was largely completed and the estimate of remaining time had reduced to a more sensible level. This situation would tend to drive me bonkers and I'd have long ago got out the surgeon's knife ... :-).
I'm aware of a third situation which may be able to cause this corruption of DCF. Is your machine ever put into hibernation for an extended period? If not, forget about it. If so, spell out the circumstances and I'll fill in the details.
Cheers,
Gary.
Gary ... To answer the
)
Gary ...
To answer the first question, no. I have run/returned four WUs today (2/3) and the DCF is still at 60.3xxxxxx. Latest WU that I have received (unstarted) shows 261:19:13 estimated time to complete. Actual time for last WUs were in the 7 - 8 hours each.
No, this computer is not put "into hibernation", although there are periods of time when no other work (other than BOINC) is being performed. BOINC is being run as a service, so no screensavers are involved.
If I've lived this long - I gotta be that old!
RE: To answer the first
)
It looks like you are supporting several projects and the multiple E@H tasks returned on the one day that you referred to may have been completed somewhat before being returned so that the effect on DCF may have happened earlier than the time you were observing. You only have one E@H task in your cache right now and you have recorded its ETC. You need to also note the ETC for the very next E@H task received (should also be around 261 hours) and then note how much that changes immediately after the current task completes. That is the point when you expect to see the next change, not when you actually upload and/or report the result. If you don't get a 25 hour reduction at that time then something is very weird.
If you don't wish to stuff around any further, stop BOINC and change the value in your state file from 60.3xxxxx to 1.3xxxxx. Depending on your cache size, you might get a flood of E@H tasks at that point when you restart BOINC.
OK, than I don't know how or why your DCF got so far out of whack, sorry.
Cheers,
Gary.