Short deadlines

dd-b
dd-b
Joined: 17 Aug 05
Posts: 3
Credit: 165802
RAC: 0
Topic 189724

My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight.

This is certainly a slow machine -- dual processor 200MHz Pentium Pro. And old server, still being used as a server for my home web and misc. hosting.

Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

Short deadlines

Quote:
My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

I assume you mean 109 HOURS?

Quote:
It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight. This is certainly a slow machine -- dual processor 200MHz Pentium Pro. And old server, still being used as a server for my home web and misc. hosting.

Then the estimate sounds about right...

Quote:
Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

We're starting to issue new workunits (your first server got some of them, in fact) which have a 2-week deadline. This should help.

Director, Einstein@Home

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118515191264
RAC: 26242421

RE: My current Einstein

Quote:
My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

You were just unlucky in that whilst your older cpuID got a set of the new 2 week work, your newer cpuID got a set of the older, one week work.

Bruce has announced that the deadline will be 14 days for new work but there is still obviously some older work to finish off and you were just unlucky enough to be hit with it. Depending on what version of Boinc you are running, there still may be no real problem as the scheduler should go into EDF mode and process your EAH work first and then refuse to get more EAH work until the other project has its fair share again. As long as you allow Boinc to do it, it should take care of this for you. I notice you have three work units. Two of those should be OK but the third will probably be too stale before it gets a chance to run. You may have to abort it. Make sure your connect to network interval is low (eg the default of 0.1 days) to prevent too early downloading of new work.

Quote:
It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight.

The estimate before crunching starts is usually much too short. Once crunching starts and things settle down, a fairly accurate number is usually reported. There are 168 hours in a week so work taking 110 hours is possible by Boinc going into EDF mode and temporarily ignoring your resource share. The other projects will get their share later on and it will even out in the long run. For machines this slow, maybe you should consider just one backup project instead of two. That would make things easier for Boinc.

Quote:
Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

The projects have a scientific problem they want our help with. They are constrained by the science requirements. Do you think that if the calcs could easily be broken down into much smaller bites then they wouldn't do so? The projects are going to do what is most efficient for the science. We just have to be a bit sensible in choosing an appropriate project according to the capabilities of our cpu. I think you would find that P-200s are probably rather thin on the ground as serious workhorses these days. The projects will cater for the type of cpu that most people are likely to have and that would be a little faster than a P-200. But hey, I'm not criticising. My slowest successfully contributing box is a P-100 :). It takes 4 days to do a Seti WU but it works!! I wouldn't dream of trying to run EAH on it.

What three projects are you trying to support and which one is your "main" one?

Cheers,
Gary.

Darrell
Darrell
Joined: 3 Apr 05
Posts: 12
Credit: 519360047
RAC: 751940

I received two E@H WUs today,

I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2500681
RAC: 0

The original idea of those

The original idea of those public DC projects was not to make people buy new machines and have them run 24/7, the original idea was to make people use the *spare* CPU time while they have their boxes running anyway.

Maybe the DC freaks spoiled this idea quite much *blush* so now the project people expect all helpers to use their computers for DC 24/7 and if there are some CPU cycles left might allow different programs.

ABT Chuck P
ABT Chuck P
Joined: 9 Feb 05
Posts: 20
Credit: 363204
RAC: 0

RE: I received two E@H WUs

Message 15691 in response to message 15689

Quote:

I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.


===================
Check task manager and see if something else is using cpu power.


Darrell
Darrell
Joined: 3 Apr 05
Posts: 12
Credit: 519360047
RAC: 751940

RE: RE: I received two

Message 15692 in response to message 15691

Quote:
Quote:

I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.


===================
Check task manager and see if something else is using cpu power.

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.

pe
pe
Joined: 15 Feb 05
Posts: 7
Credit: 25199705
RAC: 0

actually the time boinc

actually the time boinc counts is only the time the process actually gets CPU cycles..
besides that your 3ghz CPU should finish 2 wu in about 12h when hyperthreading. maybe your CPU is getting too hot and therefor its being throttled? afaik p4s get auto-thermal throttled by 80°C core temp.
cant think of anything else atm..

ABT Chuck P
ABT Chuck P
Joined: 9 Feb 05
Posts: 20
Credit: 363204
RAC: 0

RE: RE: RE: Check task

Message 15694 in response to message 15692

Quote:
Quote:
Quote:
Check task manager and see if something else is using cpu power.

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.


==========
Sorry for not being clearer. I was attempting to see if some other process was grabbing more than normal.


Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118515191264
RAC: 26242421

RE: Yes, I have other

Message 15695 in response to message 15692

Quote:

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.

The ETC is not usually wrong by very much so I think there is something unusual about how your machine is running. I have a P4 2.6G HT machine that does 2 EAH work units every 13.5 hours approximately. Before crunching starts, the estimate is 10 hours but this quickly extends to 13.5 hours and then stays there once crunching starts. In your case, the poor ETC performance is probably associated with the "No heartbeat" errors listed below.

Here is the stderr output from one of my results:

Quote:

Fstats.Ha: bytecount 1875484 checksum 89912677
Fstats.Hb: bytecount 1492423 checksum 71543681

Here is part of the stderr output from one of your results (ResultID=8035023):

Quote:

Resuming computation at 482/19528/19528
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 510/21688/21868
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 524/21868/22048
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 548/22138/22228
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 38/2340/2340
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 62/3600/3690
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 87/4320/4320
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 116/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 75/4320/4320
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 100/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 124/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
....
Resuming computation at 224/7379/7559
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 249/8819/8819
No heartbeat from core client for 31.000000 sec - exiting
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 74/4230/4230
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 103/5130/5130
Resuming computation at 770/32936/33116
Resuming computation at 1219/58224/58224
Resuming computation at 1881/105831/107181
Resuming computation at 2995/159555/160005
Resuming computation at 3627/177194/177194
Resuming computation at 5141/242621/242621
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 7518/370869/370869
No heartbeat from core client for 31.000000 sec - exiting
.....
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 13817/404363/404363
No heartbeat from core client for 31.000000 sec - exiting
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 13930/409673/409853
Fstats.Ha: bytecount 1542973 checksum 73854417
Fstats.Hb: bytecount 947203 checksum 45337763

The "No heartbeat" error message is documented in the Wiki without any positive conclusion being reached as to what exactly causes one of the components to crash or how serious it really is.

The other thing of note is how often the computation keeps getting resumed. Have you enabled the preference for only doing work when the computer is idle? I've never used it on my boxes so that I can only imagine that it would create a string of these messages if it were enabled.

If so, why don't you try letting BOINC and EAH run all the time as they really wont interfere with your normal computing activities. BOINC/EAH is very quick in getting out of the way when you have other serious tasks to perform. You wont really be able to detect that it is running unless you happen to catch it when it is performing one of the auto benchmarks or downloading a new data file. Both of these are quite infrequent occurrances.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.