Boinc overestimating WU time

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519820057
RAC: 19570
Topic 215129

Recently (the last few days) I've noticed Boinc insisting my AMD GPU will take about 40-50 minutes to complete an Einstein WU. It actually takes 8-9 minutes. Despite completing one every 8-9 minutes, it refuses to lower the estimated time for future similar WUs in the cache.

Two things have changed recently that may be the cause: 1) New version of Boinc installed. 2) I persuaded the built in Intel graphics to also run Einstein (these are estimated correctly at 15 minutes). I wasn't doing this before as my motherboard refused to run discrete and builtin graphics simultaneously, but a new driver fixed it.

I also have a Nvidia card running SETI, but that's been there for a year without causing this problem.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2982510497
RAC: 752899

Replied to your question at

Replied to your question at BOINC_dev. Your option (2) is the cause - read https://einsteinathome.org/content/not-getting-gpu-wus-anymore#comment-165295 here.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7279281713
RAC: 1983915

Peter Hucker wrote:2) I

Peter Hucker wrote:

2) I persuaded the built in Intel graphics to also run Einstein 

I also have a Nvidia card running SETI, but that's been there for a year without causing this problem.

BOINC is not smart enough to generate accurate estimates for diverse workloads in many cases.

If you want to check that as an explanation, just stop running ALL other kinds of BOINC-supervised work save the one of interest, and, assuming that work has consistent execution times (not all do), you'll soon see the estimate heading in the right direction.

So long as you set a small enough work queue size request, this won't matter in most cases, unless you are trying to bridge across expected work availability outages.

 

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519820057
RAC: 19570

I've got everything set to

I've got everything set to 1+1 days of cache, so it shouldn't cause a problem.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443625062
RAC: 25897025

Peter Hucker wrote:I've got

Peter Hucker wrote:
I've got everything set to 1+1 days of cache, so it shouldn't cause a problem.

Without knowing how many different projects run on this machine and what resource share each one has, the 1+1 days setting could easily be a significant part of the problem.  1+1 means BOINC won't start requesting more work until it gets below what it thinks is 1 day of work.  If you really want a steady 2 days of work, you should set 2+0.   With 1+1,  you have set a hi-water mark of 2 days and a lo-water mark of 1 day.  A work request will attempt to fill up to 2 days worth and there won't be a new request until what remains drops below the lo-water mark.  Is that really what you want - an oscillating work cache?

If there are several projects sharing your computer and a 1 day increase in work cache is suddenly triggered by the lo-water mark being reached, and if the preferred project can't supply at that particular time, you could easily end up with a reliable project like Einstein supplying a large proportion of that 1 day increase.  Is that really what you want to risk?  If you have a fixed X+0 setting, BOINC will always be requesting in small sips when the work level drops below X rather than large 1 day gulps.

In your opening message, you said:-

Quote:
Recently (the last few days) I've noticed Boinc insisting my AMD GPU will take about 40-50 minutes to complete an Einstein WU. It actually takes 8-9 minutes. Despite completing one every 8-9 minutes, it refuses to lower the estimated time for future similar WUs in the cache.

Also, you stated over on the BOINC boards that you ended up with so much work that BOINC was estimating 10 days to complete it when the true time to completion was just 2 days.  So with a two day cache, and a 45 min estimate that's not being reduced, how could BOINC ask for enough work to get to 10 days @ 45 min per task?  One possible way would be that BOINC actually did what it was supposed to do and (by reducing the DCF) reduced the estimate from ~45 mins to much closer to 9 mins (when you weren't looking and so didn't notice), then hit the lo-water mark and filled to the full 2 days in a big hurry.  Then an internal GPU task finished and reset the DCF upwards so that the 2 day cache suddenly became a 10 day cache.  It's probably no coincidence that there is a factor of 5x between both 2->10 days and 9->45 mins.  Your 1+1 setting could have contributed significantly if this is actually what happened.

Imagine what would happen if you had a 1+0 cache setting.  BOINC would be making regular (but small) work requests as the estimate for AMD GPU tasks progressively became lower.  There would never be the opportunity to get a full days worth of tasks in one big hit.  Chances are the small requests would be shared around whatever projects are supported on that host.  BOINC should not be able to go into high priority mode.

It's always going to be difficult for BOINC to react sensibly if you have a number of projects and your work cache settings are such that it's possible for one project to supply a big bunch of tasks all at once.  It's also prudent to lower your cache setting when you support multiple projects.  Because Einstein is quite reliable (in the main) it will become the 'go to' project if others can't supply at a particular point in time.  With whatever mix you have you should start small and only increase if you are very sure that BOINC is fully able to cope.

 

Cheers,
Gary.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519820057
RAC: 19570

Gary Roberts wrote:Without

Gary Roberts wrote:
Without knowing how many different projects run on this machine and what resource share each one has, the 1+1 days setting could easily be a significant part of the problem.  1+1 means BOINC won't start requesting more work until it gets below what it thinks is 1 day of work.  If you really want a steady 2 days of work, you should set 2+0.   With 1+1,  you have set a hi-water mark of 2 days and a lo-water mark of 1 day.  A work request will attempt to fill up to 2 days worth and there won't be a new request until what remains drops below the lo-water mark.  Is that really what you want - an oscillating work cache? 

Yes, the reason I do that is so it's not constantly pestering servers for downloads.  Surely it's best to download a day or so at once than 1 WU at a time?  And I'm not that concerned if occasionally Milkyway doesn't have any work as their server is down, so it takes Einstein instead.

Gary Roberts wrote:
you ended up with so much work that BOINC was estimating 10 days to complete it when the true time to completion was just 2 days.  So with a two day cache, and a 45 min estimate that's not being reduced, how could BOINC ask for enough work to get to 10 days @ 45 min per task?  One possible way would be that BOINC actually did what it was supposed to do and (by reducing the DCF) reduced the estimate from ~45 mins to much closer to 9 mins (when you weren't looking and so didn't notice), then hit the lo-water mark and filled to the full 2 days in a big hurry.  Then an internal GPU task finished and reset the DCF upwards so that the 2 day cache suddenly became a 10 day cache.  It's probably no coincidence that there is a factor of 5x between both 2->10 days and 9->45 mins.  Your 1+1 setting could have contributed significantly if this is actually what happened.

How does Boinc request a certain amount of work?  Is the amount calculated at my end or the server end?  Since the above has only just started happening, I think the request was made before the problem, so it got 2 days worth, then later called it 10.  Now it miscalculates, will it ask for 2 days correctly calculated by the server, or 2 days as calculated by Boinc, which would really be 0.4 days?  Either is ok by me.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443625062
RAC: 25897025

Peter Hucker wrote:... Surely

Peter Hucker wrote:
... Surely it's best to download a day or so at once than 1 WU at a time?

I used to think like that ... until I looked into the reason why a scheduler request for a bunch of work seemed to be taking such a long time to be processed.  This was quite a long time ago, my recollection might be a bit hazy and, behind the scenes, the Devs may well have tweaked things differently.  However, I don't know of any essential difference these days so I suspect this still applies.

I also have very recent experience in seeing how long it took to process a work request for a host that was many hours into a 24 hour backoff and had accumulated of the order of 100 completed tasks (and whose work cache was therefore 100 tasks short at the time).  After I hit 'update' to force things, it took several minutes - I went away, made a cup of coffee, and came back to find nothing had happened.  Eventually, a reply was received, all OK, and new tasks were flooding in.

When I first investigated why it took so long (probably a couple of years ago) I used the 'last contact' website link for the host to see what the scheduler logs said.  I don't claim to be an expert in reading these logs, but my impression at the time was that the scheduler was assembling a list of new tasks to send on an individual task by task basis and performing a series of tests and checks to make sure each individual task met all the requirements to be an appropriate task to send.  There were also log entries about checking for the availability of any 'resend' tasks before considering new tasks.  I got the distinct impression that there was nearly as much work in sending a number of tasks in a single request as there was in handling the same number of individual requests for a single task each time.  No doubt there would be a saving in having a single request but not as much as you might imagine.

There is also another factor to consider.  If you ask for a full day's work in a single request, that's not the only request you make for the day.  If you are completing and uploading tasks every 9 minutes, your BOINC client will be wanting to 'report' these at reasonably regular intervals.  I think the client does this about every hour even if it doesn't need more work.  If you combine a 'top-up' work request with the reporting of what has been completed, you probably have the best compromise as far as impact on the servers is concerned.  In a perfect world where the task estimate (DCF) isn't fluctuating wildly, you could simulate hourly completed task reporting and cache top-up with something like a 1+.04 cache setting.  With that, the difference between hi-water and lo-water is one hour :-).  Unfortunately, the world is far from perfect.

Peter Hucker wrote:
How does Boinc request a certain amount of work?  Is the amount calculated at my end or the server end?

Your client is in control.  It uses current estimates to work out when the cache drops below the lo-water mark and when that happens it makes a big enough request to at least reach the hi-water mark.  With GPU tasks in particular, the scheduler doesn't seem to send everything in one transaction.  The client will then make a series of diminishing requests every minute until the cache is full.  My experience is that when you have an X+0 cache, the client gets what it needs in a single request since it's only asking for a tiny amount. If I change a cache setting by a reasonable amount (eg going from 1 day to 1.25 days (not using additional days)) the client will end up making about 6 to 8 requests in total until the cache is full.  That's another reason for understanding that having a large gap between hi and lo isn't really saving the number of scheduler contacts you think it is.

Peter Hucker wrote:
Since the above has only just started happening, I think the request was made before the problem, so it got 2 days worth, then later called it 10.

Yes, that's exactly what I was trying to point out :-).  The request for 2 day's worth was made at a time when your client 'knew' that each task would take around 9 mins.  A little later, your client then 'knew' that each task was going to take 45 mins.  That's not a miscalculation.  It's the client trying to adapt to the possibility that the work content of a task might have changed.  Whilst we know it hasn't, there is no way for the client to know.  It has been crippled with a single DCF to manage all different searches for a given project.

Peter Hucker wrote:
Now it miscalculates, will it ask for 2 days correctly calculated by the server, or 2 days as calculated by Boinc, which would really be 0.4 days?  Either is ok by me.

It's the client that works out the estimated crunch time based on DCF.  The client tells the server what the cache shortfall is in secs and what the estimated crunch time currently is.  The server works out how many of which type to send to meet the request.  As I think about why the server can use several consecutive responses to send all the needed tasks, perhaps the server isn't compensating for task concurrency.  That would make sense. If you told the server the estimate was X secs and the GPU could actually do two tasks in whatever the estimate is, maybe the server just sends half the number really required each time because it doesn't know about the concurrency.  That would perfectly explain why the GPU task requests I see can get filled in a number of consecutive and diminishing requests as I described above.  All the GPUs I monitor are running tasks 2x or 3x.

At the end of the day, it's perfectly OK for you to run your machine as you see fit.  I'm certainly not giving you a lecture or trying to get you to change your ways.  I was just trying to explain what I think is happening and how you could make changes if you wished to mitigate against the effects of DCF swings.  You really don't need BOINC mindlessly going into panic mode whenever it feels like it so I was just trying to explain how that could have happened and what you could do to prevent it in future.

There is one other idiosyncrasy of the work fetch system you might be interested in thinking about.  Since you said your AMD GPU tasks only take 9 mins, I believe you only run one at a time. (I'm guessing that two tasks in 9 mins is beyond the capabilities of your GPU :-).)  If you ran two tasks simultaneously, you would probably be able to get two tasks in about 15 mins (rather than two consecutive tasks taking 18 mins).  You get a performance improvement but the real point of the exercise is that BOINC has a higher estimate (~15 mins) and the DCF hasn't been dragged right down.  The estimate may still swing up to 45 mins but that's only a factor of 3 rather than 5 so when that next happened BOINC would see 6 days rather than 10 days of work - less likely to go into panic mode.

There are quite a few ways to mitigate against the 'single DCF' problem.  I'm not trying to tell you what's best.  I'm just explaining what I've observed over time.  No guarantees it's fully correct or the total picture but it has worked for me.  I hope you have success with getting your machine to run the way you wish.

 

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2982510497
RAC: 752899

Many years ago, Rom Walton

Many years ago, Rom Walton wrote a blog post called "BOINC Client: The evils of 'Returning Results Immediately'". These links take you to the original blog and the follow-up

That's a long time ago (October 2006), and much has changed since then, but the point was that much of the server (scheduler) overhead wasn't in the C code checking each test on a potential task, but the database load in opening and closing connections and setting up the initial query. I think that may have been part of the reasoning behind setting up the 'additional work' hysteresis fetch pattern a few years later.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519820057
RAC: 19570

" If you ran two tasks

" If you ran two tasks simultaneously, you would probably be able to get two tasks in about 15 mins (rather than two consecutive tasks taking 18 mins)." 

Just had a check (in the interests of getting more work done) on my computer with a Radeon R9 290, and Milkyway uses almost 100% GPU time according to MSI Afterburner.  Einstein however looks more like half (it jumps between 0% and 100%), so I stuck Einstein on 2 at a time, and now it's on 100%.  It's doing two tasks in 13 minutes instead of one in 8.25.  So a 27% increase in productivity. 

I thought I'd try Milkyway aswell, and that got a 22% increase in productivity, which is odd.  Apart from gaps of a second or two between tasks, the GPU claimed to be fully loaded with only 1 task according to Afterburner.   Also when I ran 2 Einstein tasks, the fan immediately got a lot louder!  But not so with Milkyway.

And why doesn't Boinc automatically run concurrent GPU apps?

I just forced it to run a Milkyway and an Einstein at once on the same GPU by suspending all tasks except one of each.  They also fully utilised it nicely.  I was wondering if the single precision and the double precision cores inside the GPU would run at once (Einstein is SP and Milkyway is DP), and give me even more power, which I think it is.  I get an estimate of 38% faster compared to only 27% faster for two Einsteins or 22% faster for two Milkyways.  I've now set both projects to the same priority, detached and reattached them both to reset Boinc's idea of which one should currently be doing more work, and it's automatically running one of each at the same time :-)

Ok change that, two of each.  Even faster.  I set in the app config file: Max concurrent = 2 for each project, but 0.25 GPU usage, so it's forced to always run two of each, unless one has no work available, then it'll just do two of the other one.  Although sometimes it only runs 2+1 WUs at a time.  Strange.  I've changed the config file to say 0.2 GPU usage each, and it runs 4 again.  It seemed to be trying to leave a bit of GPU free?

Done the same thing on another computer, with a newer but less powerful Radeon RX 560, and giving it more than one task of any project slows it down significantly!

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519820057
RAC: 19570

After all that yesterday, I

After all that yesterday, I noticed later in the day it wasn't going as fast as I thought, so I took readings again, and again this morning.  Now instead of 22-38% faster, concurrent tasks only give me 7-15% faster.  How odd - was the GPU getting tired?  It wasn't overheating.  I think I'll go back to single tasks per GPU.  I have kept though in the config files the CPU usage as measured by me for every type of GPU task, as the ones supplied by the server are completely off.  Then I've lowered Boinc's number of CPU cores it's allowed to use until the CPU isn't maxed out, so I know it'll never throttle a GPU.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118443625062
RAC: 25897025

Richard Haselgrove wrote:...

Richard Haselgrove wrote:
... much of the server (scheduler) overhead wasn't in the C code checking each test on a potential task, but the database load in opening and closing connections and setting up the initial query.

Hi Richard,
Thanks for the links.  I browsed both and also the reference to miw's feedback from the second link.  I'm quite illiterate when it comes to some of the points being debated but I do understand at a pretty basic level how important database connections and transactions are in adversely affecting the load on (and responsiveness of) a project's servers.  Also, I got the impression that a lot of what was fueling the heated commentary at the time was the "dreaded V5.5.0" BOINC version.  I don't really recall ever knowing anything about that (but I could simply have forgotten)  :-).  Yes, I do live under a rock a lot of the time :-).

There were no GPU apps at that time and task run times tended to be more like hours rather than minutes.  A one day hysteresis may well have been OK for that era but not now for how fast the current GPU tasks at Einstein get crunched.  As I alluded to in what I posted, BOINC will report results after about an hour or so, so you may as well take advantage of that and have a top-up request at the same interval.  Why not a 1+0.1 cache setting rather than 1+1?  I'm not advocating "return_results_immediately" type behaviour - just let BOINC do it when it wants to - but at an interval more suited to today's conditions.

Another point to consider with 1+1 at Einstein, that hasn't been specifically mentioned in this discussion, is the fact that there are daily limits on task downloads that have become quite restrictive for fast GPUs.  I very recently increased my current cache setting of 0.9+0 to 1.1+0 which is really a quite modest increase.  A couple of the fastest machines (not really very fast by today's standards) hit the daily limit and had communications suspended for the remainder of the day.  I had expected this might happen as I had timed the cache increase to be close to the end of the project 'day' (8AM locally) when the daily limit was already largely used up, so any backoff would be relatively short and not a problem.  A 1 day hysteresis increase could be a much more dramatic disturbance (and random as to timing) for BOINC to deal with.

I always try not to waste server resources in the way I configure my hosts.  Without the freely donated volunteer computing contributions, projects couldn't process the phenomenal amount of data that they do.  If projects need the work done, the least they can do is to attempt to provide the appropriate level of infrastructure to accommodate the load.  If they don't have the funding or manpower for that, perhaps they need to scale down their expectations.  I don't think it should be up to volunteers to be overly paranoid about their impact on the project servers.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.