Deadline Problem

Todd Wright
Todd Wright
Joined: 6 Mar 05
Posts: 13
Credit: 39039
RAC: 0

> Your computer is slow

Message 10203 in response to message 10196

> Your computer is slow compared to today's machines. As you stated, there
> is nothing wrong with it. However, your attempt to use
> this machine inappropriately is misapplied. However strong your desire to
> support Einstein@Home this machine with the constraints of operating time is
> just not the right tool for this job.

And Im sure than an intel box (of whatever processor model) is inapropriate hardware on which to run a massive search engine which indexes over 8 billion pages and is accessed (searched) by over 82 million unique users per month. This kind of processing would surely require a supercomputer. Yet when you compine thousands of these boxes...

Yes, I am (as an example) referring to Google.

Paul, you (and others it seems) forget that this is a parallel process. The computing power is not in the CPU size of each processor, it is in the sheer number of processors. This is the benefit of distributed computing. E@H would attract more processors (hosts) if the owners of those processors could be assured of having credit awarded for the free (to the project) processing power that costs them actual money to provide.

The problem is also not the size of the work unit. It is (as has already been stated by many) the short E@H deadline in combination with boinc not properly reflecting the actual amount of CPU time required for each work unit, causing it to queue more work than can be completed by the deadline on certain platforms.

Currently on a PIII 800Mhz machine I have 3 E@H Work units which have not yet started and will not meet their deadline (today) since despite boinc reporting an estimated CPU time of 15 hours 51 minutes, they take around 21 hours to process. The 6 hour difference in each work unit has caused previous work units to use the time that these units were originally allocated.

I also have 1 work unit on an AMD2500+ (Surely this is not an inapropriate processor) which has the same problem. Estimated time per WU is 6h30m but the actual time is closer to 8h.

My other cruncher (Dual AMD 1800+) estimates 10H46 and takes around 15 hours per work unit. It too has WU which will miss the deadline today.

These delays add up and over time, and without intervention, eventually cause all newly downloaded work units to miss the deadline as the queue gets further behind.

Mike
Mike
Joined: 20 Feb 05
Posts: 151
Credit: 5536135
RAC: 0

Hi I agree nearly with

Hi

I agree nearly with it.
Another thing is the ability of connecting the internet.
I think it would be nice to have longer deadline because the results are often finnished in time but cannot be uploaded.
It has nothing to do with the cpu.
I´m not that credit freak but loosing to much of it isn´t that funny.
So onces that returned not in the deadline did the same sience but get nothing.

greetz Mike

JoeB
JoeB
Joined: 24 Feb 05
Posts: 124
Credit: 89446568
RAC: 28838

Todd, Without addressing

Message 10205 in response to message 10203


Todd,
Without addressing the larger turnaround time issue, your specific problem seems to be too many work units to compete in time:

> It too has WU which will miss the deadline today.

Have you tried to reduce your "Connect to the internet" time in general preferences? On my machine I have tried from 0.6 with a maximum of 3 "ready to run" WUs at any one time to 4 days with way too many WUs. I'm now running it at 1 day.

Joe B

Todd Wright
Todd Wright
Joined: 6 Mar 05
Posts: 13
Credit: 39039
RAC: 0

> Have you tried to reduce

Message 10206 in response to message 10205

> Have you tried to reduce your "Connect to the internet" time in general
> preferences? On my machine I have tried from 0.6 with a maximum of 3 "ready to
> run" WUs at any one time to 4 days with way too many WUs. I'm now running it
> at 1 day.

JoeB,
I have answered this. Please read my other posts in this thread (perhaps you should familiarise yourself with the issue by reading the whole thread).

Briefly, it is ridiculous to suggest that people change a global (all projects) setting to a value that defeats that setting's purpose in order to cater for the hard line that this one project seems to take in regards to deadlines, and the inadequacy of boinc to request the appropriate amount of work.

Gareth Lock
Gareth Lock
Joined: 18 Jan 05
Posts: 84
Credit: 1819489
RAC: 0

> > > Whilst the 64 can

Message 10207 in response to message 10202

>
> > Whilst the 64 can keep up, the 1900+ is pushing it's deadlines most of
> the
> > time. Usually it just about makes it, but if there is a power cut for
> example,
> > it fails to make the deadline. There have been two of these said power
> cuts in
> > the last week and I have already lost credit for three late WUs as a
> result.
>
> I'm wondering if you might need to do some tinkering with your resource share.
> My Athlon 850 is running Einstein, ProteinPredictor, SETI, Pirates
> (occasionally), and a pre-alpha project. It has no problems returning results
> for all projects on time.
>

My resource shares are fine... Four projects running including E@H, all get equal share of CPU time... All set at 100% on their respective websites, giving them 25% each. I don't do favourites on the same machine. As I have two hosts, some projects are duplicated on the second host in the same fashion.

Gareth Lock
Gareth Lock
Joined: 18 Jan 05
Posts: 84
Credit: 1819489
RAC: 0

> JoeB, > I have answered

Message 10208 in response to message 10206

> JoeB,
> I have answered this. Please read my other posts in this thread (perhaps you
> should familiarise yourself with the issue by reading the whole thread).
>
> Briefly, it is ridiculous to suggest that people change a global (all
> projects) setting to a value that defeats that setting's purpose in order to
> cater for the hard line that this one project seems to take in regards to
> deadlines, and the inadequacy of boinc to request the appropriate amount of
> work.
>
I agree wholeheartedly Todd

Gareth Lock
Gareth Lock
Joined: 18 Jan 05
Posts: 84
Credit: 1819489
RAC: 0

To put more here and add to

To put more here and add to the actual time query I have the information on this 9th April WU that I'm talking about that might still just make it. The estimated time for the other E@H WUs on that 1900+ are 7:51:37. This WU for 9th April is already at 8:46:34 at 95% at the time I post this.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117503727006
RAC: 35423095

> Todd, > Without

Message 10210 in response to message 10205

> Todd,
> Without addressing the larger turnaround time issue, your specific problem
> seems to be too many work units to compete in time:

Joe,
This is precisely the problem that Todd seems unable to comprehend. Not only does he apparently not understand this problem, but he has the temerity to label as "ridiculous", those who would dare to suggest that the connect to network interval should be lowered to a more reasonable value. Unfortunately, you, along with Paul Buck, myself and many others are all subject to this "treatment". Here are a couple of particularly insulting quotes:-

> ...many supposedly informed people suggest setting this to 0.1 days for Einstein, which is ridiculous...

> ...The projects should remember that if it wasnt for we volunteers, that they would be waiting for hundreds of years for their results....

> ...Paul, you (and others it seems) forget that this is a parallel process....

That last one takes the cake as far as I'm concerned. Paul Buck would be one of the most sincere, honest, hard working, courageous, thoughtful and cooperative contributors to various BOINC related message lists. The guy is an out and out legend and would have forgotten more useful BOINC related knowledge and information than the rest of us have actually managed to acquire. Yet here is this guy, presuming to give him a lecture on the meaning of parallel processing as if he were a rank newbie. The arrogance of that attitude is quite unbelievable....

> > It too has WU which will miss the deadline today.
>
>
> Have you tried to reduce your "Connect to the internet" time in general
> preferences? On my machine I have tried from 0.6 with a maximum of 3 "ready to
> run" WUs at any one time to 4 days with way too many WUs. I'm now running it
> at 1 day.

On past performance you may as well talk to a brick wall. Your advice is very suited to solving this problem but I'm afraid it's likely to be ignored and criticised. Various people in this thread are laying the entire blame for their own ineptitude on the 7 day deadline issue. They are forgetting that the developers have a perfect right to set the project conditions for the overall benefit of the science involved. They should simply make their suggestion about increasing the deadline without stooping to calling the developers a bunch of uncaring, unthinking idiots. When their suggestions are politely declined (with reasons given - do a search on past responses) they should move on and work out a strategy for solving their "problem".

Here is the best strategy, taken from your own comments. Essentially what you are saying is "start small and if you don't get enough, then increase it gradually until you do. If you go too far and get too much, then back it off until you find the sweet spot".

This is precisely why the default is 0.1 and why I and others who understand the nature of the beast heartily recommend starting at 0.1. You get work without getting too much excess. Once things settle, then start increasing gradually until you find the sweet spot, which will be a bit different for everybody, simply because of all the variables that change from case to case.

As a final comment, I would suggest that there is a level of intellectual dishonesty in the arguments of those who claim (and I'm paraphrasing here), "the seven day deadline is totally unacceptable because it means I'm always going to lose credit for work on my slower boxes which can't be completed in time. Even my faster boxes have problems so this proves that the deadline is way too short."

The dishonesty comes from the fact that they are rarely up front with what their real intentions are. In a lot of cases their main aim is to give Seti the lion's share of the resources and use the other projects as a simple backup for those (many) occasions when Seti seems to have problems. So, they are not really wanting to disclose how low a share they are allowing to go to the backup projects. Also, because of Seti's past history, they feel they need an excessive cache so that the backup project doesn't get any increase in its share if Seti is down for a long time. Then they have the temerity to attempt to postulate what are effectively ridiculous "proofs" that the backup project is entirely at fault in causing them to lose credit.

It's great to support multiple projects and it's great to have a "favourite" but it's stupid to effectively exclude the backup from getting a meaningful share of the overall resources. If you support multiple projects you don't really need excessive caches. The multiple projects become the cache. If a project can't stay up long enough to keep supplying your new work, shouldn't a project which can, be given the ball to run with?

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117503727006
RAC: 35423095

In case anyone is wondering,

In case anyone is wondering, my previous message took quite a while to compose and during that time (and unseen by me) 4 further messages were added between JoeBs and mine. (JoeBs was the only one I could see while I was composing).

I stand by the comments I made and I think Todd's additional comment, which I have only now just seen, supports some of the points I was making. There is really no excuse for simply trying to belittle someone who is making a counter argument against your own point of view. Comments like...

> I have answered this. Please read my other posts in this thread (perhaps you
> should familiarise yourself with the issue by reading the whole thread).
>
> Briefly, it is ridiculous.....

are simply not acceptable.

@Todd,
Please win your argument on its merit and not by casting ridicule.

If your 4 projects have equal share of resources then you can stop losing any work to deadlines simply by reducing your connect to network interval until you find the sweet spot. You really don't need an excessive cache and the headaches that it creates.

Cheers,
Gary.

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

There are three distinct

There are three distinct issues here. The underestimate of work, the choice of project, and the 'round numbers' syndrome.

One is that the estimated time for a WU is underestimated by E@H.

This, I think, is a real bug. No project can accurately determine the length of a WU, but the errors should be in the other direction. For example Predictor WU claim to take 9hrs on my machine but actually take between 3 and 6. This is good: with that range of uncertainty it is better to overestimate, which sometimes leaves machines without work but never wastes work that has actually been done.

The quick fix I'd like the E@H project people to do is simply to immediately double the current estimated number of ops in each WU. Surely this is simply a matter of adding one to an index, or x2 a parameter, or somesuch?

The resulting empty caches for machines that do not connect frequently would be a lot better than the totally wasted work on those machines.

A slightly better fix would be to pick a better fudge factor: would a factor of x1.4 be sufficient? is the given structure capable of resolution better than a factor of 2? (I ask because on a given machine E@H WU always estimate at exactly twice the duration of Predictor WU, which seems an odd coincindence...)

The second is the "you must extend the deadline as my machines sometimes miss" argument. Sorry, I totally disagree. The deadlines are set for good reasons. All Distributed Computing projects set deadlines. The slack that is built in is only partly for the use of the client, it is also there to allow for erroneous estimates, netwrok downtime, server failure, etc etc.

In my view if an average WU takes more than one third of the deadline time to crunch, and I mean total elapsed time with your usual pattern of working (as opposed to time queuing in your cache) then you should not use that machine for that project. That's a rough rule of thumb: equal shares for queuing, crunching, and for contingency. That is no disresepct to your machine: it is a matter of matching your machine to what it can *comfortably* do.

There might be something the project could do: My preferred solution here would be for E@H to issue shorter and longer WU, so you could pick a WU of the ideal duration. I do not know if that is scientifically feasible. It may not be, or even if it is the extra effort may not be worthwhile in terms of the additional volunteers attracted by the smaller WU.

The third issue is different again.

Even if 7 is roughly the right deadline for a WU, I'd suggest it is a mistake by the project (and by others) to have a lifetime that exactly fits a normal workpattern. Because there will be 7-day cycles in the availablility of resources, it would be better to have the deadline *slightly* longer, like 7.5 or 8.5 days. That would stop the effect where a computer that connects twice a week misses a deadline because of slight flutter in the exact time of the connection.

That is why, for example, in the days when clocks needed regular winding they'd be designed with a 26hr or 8day or 15day capacity, never 24hr or 7day or 14day. Go for a 'natural' interval and add a little.

~~gravywavy

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.