Estimated run-time too high

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

Example: I had the v6.6.3

Example: I had the v6.6.3 installed and the DCF went down 10% after each finished wu, but it increased rapidly, when a period of hibernation (over night) is between, so that the cpu-time is normal, but the wall-time is multiple times greater than cpu-time due to the hibernation.
When I look at the estimated time, that is left for the 4 wu's, it increased already to a high value right after resuming from hibernation.
So when one of these wu's, that is "interrupted" by hibernation, finishes, it causes the DCF to increase, so that the wall-time is taken to calculate the DCF instead of using the real cpu-time, I guess. And because I have a Quad-Core there are 4 of these wu's that causes the DCF to rise. After these 4 wu's are finished the next wu, where the wall-time and the cpu-time are nearly the same, decreases the DCF by 10%.

Then I uninstalled the v6.6.3, but left the remaining folder and files unattached, and installed v6.2.19.
And after a few days with hibernation periods the DCF decreased steadily from about 5.9 and doesn't rise after resume from hibernation.
Now the estimated times are going down to the normal value, that is the real avarage cpu-time and the DCF is now at 2.34.
The effect of the hibernatino is reproducable.

I think, it is too complicated to calculate a specific example to show the effect exactly, but I'm really sure, that until v6.2.19 hibernation isn't an issue. Don't know, if it has something to do with the CUDA-capability of BOINC.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2963859029
RAC: 712000

Bingo. That's exactly what

Bingo. That's exactly what happened.

Quote:

The advent of GPU apps (and soon multi-thread apps)
has required a number of fundamental changes in the client;
the assumption that job == 1 CPU was deeply embedded.
I've been making these changes piecemeal, scrambling to meet various deadlines,
so I haven't written any design docs; sorry about that.
Here's a summary:

0) app versions now include avg_ncpus, coprocessor usage, and
a FLOPS estimate (this defaults to the CPU benchmark,
but for GPU and multithread apps it will be different).
This info is sent from the server.

1) Estimating the duration of unstarted jobs:
jobs are now associated with specific app versions.
The estimated duration of an unstarted job is the WU's
FLOP estimate divided by the app version's FLOPS,
scaled by the duration correction factor.

2) Duration correction factor: this is now based on elapsed time
(i.e. wall time during which the job has been running) rather than CPU time.

Yes, this is affected by non-BOINC CPU load; that's as it should be.

3) "CPU efficiency" is no longer maintained; it's subsumed in DCF.

4) Estimating the duration of running jobs:
this is a weighted average of static and dynamic estimates.
The dynamic estimate is now based on elapsed time rather than CPU time.
So if a GPU job has been running for 5 min, is 25% done,
and has used 1 min of CPU, its dynamic estimate is 20 min (not 4 min).

5) round-robin simulation: this was modified to reflect multi-thread
and coproc apps (e.g., if the host has 1 GPU, only one coproc app
can run at a time).
If CPUs are idle because coprocs are in use,
don't count it towards CPU shortfall.

6) scheduler_cpus() and enforce_schedule() take coprocs and
avg_ncpus into account. They try to keep GPUs busy if possible.


(David Anderson to BOINC_dev mailing list, 03 December 2008 19:40 UTC)

It was done because BOINC doesn't record GPU time, and they realised that CPU time didn't work for CUDA. Guess they forgot about hibernation.

Edit: Changeset [trac]changeset:16609[/trac]

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

I submitted a note to BOINC

I submitted a note to BOINC Dev with the message added to the original Dr. Anderson's change and your detective work Sir Bluesilvergreen ... {edit} and richard sorry ... :){/edit}

Quote:


The changes to the scheduler policies may have missed something... note 2 and 4 of Dr. Anderson's change note pertain... the issue seems to be that the wall time continues to accumulate even during hibernation. Which introduces a potentially massive bias to the DCFs. If nothing else, it likely introduces oscillation into the DCF calculations with tasks run to completion without hibernation and tasks with hibernation war over the correct DCF...

From the Einstein@Home NC forums:

Quote:

Example: I had the v6.6.3 installed and the DCF went down 10% after each finished wu, but it increased rapidly, when a period of hibernation (over night) is between, so that the cpu-time is normal, but the wall-time is multiple times greater than cpu-time due to the hibernation.
When I look at the estimated time, that is left for the 4 wu's, it increased already to a high value right after resuming from hibernation.
So when one of these wu's, that is "interrupted" by hibernation, finishes, it causes the DCF to increase, so that the wall-time is taken to calculate the DCF instead of using the real cpu-time, I guess. And because I have a Quad-Core there are 4 of these wu's that causes the DCF to rise. After these 4 wu's are finished the next wu, where the wall-time and the cpu-time are nearly the same, decreases the DCF by 10%.
...
Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2963859029
RAC: 712000

I have a note from David

I have a note from David Anderson:

Quote:
I fixed this problem (will appear in next release)
-- David


and some new code: changeset [trac]changeset:17154[/trac]

998         // Normally this is called every second.  
  999         // If delta_t is > 10, we'll assume that a period of hibernation  
  1000         // or suspension happened, and treat it as zero  


Sounds like another band-aid, which might just work for hibernation - but don't adjust the system clock backwards, because he has (as always) assumed monotonic time increases, and not taken abs(delta_t).

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2963859029
RAC: 712000

Probably the final word on

Probably the final word on this subject:

Quote:

Good point. I checked in this change.
-- David

Richard Haselgrove wrote:
> I see changeset 17154 - thank you for the prompt attention.
>
> But shouldn't the test be abs(delta_t) >10, otherwise the elapsed time
> will be messed about by other system events, such as a clock re-sync
> backwards: similar to ticket #588 (the less-serious problem in the
> linked thread, rather than the primary problem in the ticket)?


Well done to all, especially Bluesilvergrenn for recognising the problem existed and persevering with the research, and Gary and Paul for prodding him with pertinent questions until it all fell into place.

Now all we need is a new BOINC release so we can all use the fix!

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

Ow! Ow! O! ... my arm hurts

Ow! Ow! O! ... my arm hurts ...

:)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.