You have selected to receive work from other applications if no work is available for the applications you selected

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118664470466
RAC: 19152436

RE: ... show what would

Message 84251 in response to message 84250

Quote:
... show what would happen to the DCF when an R3 is crunched again after an R4. I believe that 10% comes into the adjustment, but I'm not sure how.

When a task takes significantly longer than the estimate, BOINC becomes very concerned about deadlines. It will make the full change in one hit to prevent further downloads of tasks which might not make the deadline. When a task is significantly faster than the estimate, there are no adverse deadline issues. BOINC can afford to take its time in correcting the estimate, so it does so in 10% steps each time.

In the previous example, if an R3 task is estimated to take 30 hrs but is actually completed in 5 hrs, BOINC will deduct 10% of the 25 hour discrepancy (ie 2.5 hrs) from all future estimates for ALL further R3 tasks. So immediately after the 5 hr R3 task finished, the other R3 tasks would show a 27.5 hour estimate. BOINC actually does this by changing the value of the DCF which in turn results in the change in estimated crunch time.

Since the estimated crunch times for both R3 & R4 tasks are tied to this same DCF parameter, the 10% correction will also (quite wrongly) be made to the estimated crunch times for R4 tasks. The estimate will be progressively reduced again each time an R3 result is completed. I called this effect,"yoyoing" in the other thread. You could prevent this behaviour if there were separate DCFs for the different science runs. Alternatively, the plan could be to adjust the estimate built into the task every time a new app with a different performance is rolled out. That may be rather tedious for the Devs to attempt to do.

If you maintain a reasonably small cache of work, these gyrations, whilst being a nuisance, probably don't amount to much. It becomes a real pain if you are caught with a large cache setting and then you get a string of low estimated R4 tasks which can't be completed within the deadline (they take 12 hrs not 2 hrs in our example). Then you would need to abort some excess tasks, after lowering the cache setting to a more friendly value.

Cheers,
Gary.

gaz
gaz
Joined: 11 Oct 05
Posts: 650
Credit: 1902306
RAC: 0

hi gary thanks the light is

hi gary
thanks the light is back on
garry

Odd-Rod
Odd-Rod
Joined: 15 Mar 05
Posts: 38
Credit: 20492793
RAC: 45063

RE: In the previous

Message 84253 in response to message 84251

Quote:

In the previous example, if an R3 task is estimated to take 30 hrs but is actually completed in 5 hrs, BOINC will deduct 10% of the 25 hour discrepancy (ie 2.5 hrs) from all future estimates for ALL further R3 tasks. So immediately after the 5 hr R3 task finished, the other R3 tasks would show a 27.5 hour estimate.


Thanks! Now Ive got the full picture.

Quote:

If you maintain a reasonably small cache of work, these gyrations, whilst being a nuisance, probably don't amount to much.


Quite so, if the DCF is too high, you don't get too much work to meet deadlines, and will still get R3 work. Since the drop in DCF is slow, a couple of R3's after each other shouldn't cause problems.

Quote:

It becomes a real pain if you are caught with a large cache setting and then you get a string of low estimated R4 tasks which can't be completed within the deadline (they take 12 hrs not 2 hrs in our example). Then you would need to abort some excess tasks, after lowering the cache setting to a more friendly value.


Fortunately, I'm always connected, so my hosts have only between 0.01 and 0.05 days 'additional work buffer' (about 14 and 72 minutes). I'm still experimenting with these values. I have found that if you have no cache you can have idle time while requesting more work - especially if you have several projects with no work available. Only my son's pc has 1 day buffer and that's because he sometimes takes it to friends who don't always have an internet connection.

Thanks for the great answer, Gary.

Regards
Rod

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 1

There's a fix for this

There's a fix for this message in SVN now. See [trac]changeset:15847[/trac]

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118664470466
RAC: 19152436

RE: There's a fix for this

Message 84255 in response to message 84254

Quote:
There's a fix for this message in SVN now. See [trac]changeset:15847[/trac]

If I didn't ear-bash you with effusive thanks, I'm sure you'd never know how really, really useful that link has been. Nothing to do with the actual bug causing the annoying messages, of course :-).

The link reminded me that the type of scheduling we have here at EAH is referred to as "Locality Scheduling". I had forgotten the technical term so I had forgotten how to easily review how our scheduling was supposed to work. Throw the two words into google and up comes the wiki page that tells you all about it and more importantly gives some of the implementation details. I found just enough information to suggest a workaround for a particular problem that annoys me with locality scheduling here.

I can feel a new treatise coming on ... :-).

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325626404
RAC: 74742

RE: I can feel a new

Message 84256 in response to message 84255

Quote:
I can feel a new treatise coming on ... :-).

Go For Gold! :-)

Quote:
Quick Igor, flush the buffers and polish the ports. And make sure the epiphany regulator has a new throttling gasket and the workaround welds have been triple checked. Remember what happened last time when the scheduling breakers tripped - non local effects galore! Cripes, he's been Googling and wandered into the Wiki thicket .... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118664470466
RAC: 19152436

RE: Go For Gold!

Message 84257 in response to message 84256

Quote:
Go For Gold! :-)

Thanks for the the great laugh about Igor's antics :-).

The treatise will take a while but here is a quick summary of the problem and the workaround.

The problem is that the most efficient use of the onboard large data files is not being achieved. When the scheduler runs out of tasks to send for a particular set of data files, it requests that the client mark these files for deletion. Once so marked, you can never get more tasks for that data, even if they subsequently become available by whatever means (usually because someone else misses a deadline).

I was looking at a host that has been receiving R3 resends (ie data is onboard). It had a 3 day cache and 3 days ago, when it made a request for work, it was knocked back and the data files were marked for deletion. It has continued to be knocked back on every subsequent request for tasks (it's enabled for R3 only). I noticed it when it had about 3 hours to go on it's very last R3 task. Having read about the tag, I decided to stop BOINC and remove all these tags in client_state.xml. After restarting and setting the cache size appropriately, I was able to get greater than 10 days of additional tasks for the very data that had been marked for deletion days ago. Obviously, during the last three days, many additional resends have become available and are not being sent out very rapidly since the scheduler is waiting for suitably equipped hosts (which are now becoming scarce).

We need a way of temporarily turning off the marked_for_delete mechanism during the cleanup phase of a run so that many more hosts stay suitably equipped for longer. Now that suitably equipped hosts are probably quite scarce, the scheduler will soon have the inefficiency of sending out complete sets of data for single tasks. At least now quite a few of my hosts are taking them in sizeable batches with only the occasional extra frequency increment data being needed.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325626404
RAC: 74742

RE: Thanks for the the

Message 84258 in response to message 84257

Quote:
Thanks for the the great laugh about Igor's antics :-).


Every home laboratory/dungeon/attic should have one. Call Igors-R-Us ( in the Yellowing Pages ) or www.igors.gothic2U or via RavenNet .... :-)

Quote:
The treatise will ..... increment data being needed.


Yup, there's value in more 'gradual grace' before the delete calls. This would depend on a higher level evaluation of when a given 'data band' has been completely fulfilled on a project wide scale. This is a subtler question which goes beyond a single, or even a small set of, host/server exchange(s).

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

RE: Hmmm... DUHHH....

Message 84259 in response to message 84238

Quote:

Hmmm...

DUHHH.... Alinator!! They are to different projects! Never mind...

Yes, this is could be the ignore the message because it's bogus scenario. Still his last comment doesn't fit the pattern though. :-?

Get more coffee! :-D

Alinator

Hi Alinator,

I left the other project messages in there to show that Einstein said it was giving 1 new task, but there are no messages about downloading the task.

Anyway I think its easier to just ignore the messages for the time being. Apart from the advantages of not downloading lots of extra stuff it also gets around the DCF issue as I only have S5R4's now.

If there are a lot of resends I could setup an app_info on maybe one host. I still have the power-user app (see earlier messages in this thread). Are there lots of S5R3 resends?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118664470466
RAC: 19152436

RE: I left the other

Message 84260 in response to message 84259

Quote:

I left the other project messages in there to show that Einstein said it was giving 1 new task, but there are no messages about downloading the task.

Nor should there be since this project is quite different in the way tasks are distributed. You already have a package of large data files. There are no more files to download. There are just a few flag values for the science app to use when it operates on the large data files you already have. Those flag values are part of the exchange with the scheduler.

Did you see the bit that said

Quote:
Scheduler request succeeded: got 1 new tasks

This bit categorically, undeniably, and without a shadow of a doubt, and in plain English what's more, informs you that you did actually get a task :-).

There's no use denying the charge - it's an open and shut case and the jury has already reached its verdict :-).

Quote:
Anyway I think its easier to just ignore the messages for the time being. Apart from the advantages of not downloading lots of extra stuff it also gets around the DCF issue as I only have S5R4's now.

The red messages have nothing to do with your ability or otherwise to get either R3 or R4. You get the same messages in both cases. It's simply an annoying bug that otherwise is harmless. Since you have completely transitioned to R4 (no R3 left) there is not much you can do right now to get more R3 tasks. I suspect that they might be a bit more available in the near future and then taper off again in a couple of weeks.

Quote:
If there are a lot of resends I could setup an app_info on maybe one host. I still have the power-user app (see earlier messages in this thread). Are there lots of S5R3 resends?

There are lots of resends at the moment. They are common but not readily available. In the last two days I would have snagged about 300 or so. Each machine I had grabbing the resends stopped when its cache was full (10+ days worth) and not because the resends ran out. One machine just grabbed a further 30. Unfortunately there is quite a trick to encouraging the scheduler into sending them. It only wants to send the work to hosts that already have the requisite large data files onboard. If a previous exchange with the scheduler has marked the data files for deletion, you won't get resends unless you "unmark" the large data files. If they've actually been physically deleted rather than just "marked", it's too late anyway. All this requires surgery on client_state.xml and I'm not about to encourage that.

Some of the resends I've received have been on the server for a while (almost 24 hours in one case I saw). If there are insufficient hosts with the right data files, the scheduler will get sick of waiting and will send the full package of resends and data to any host. I saw a machine in that category today. It was crunching R4 data (no R3 onboard) and it made a work request. The scheduler sent an R3 resend plus all the R3 data. Once it had the data, it was then able to get a heap more R3 tasks for that data.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.