Short deadlines

dd-b

Joined: 17 Aug 05

Posts: 3

Credit: 165802

RAC: 0

20 Aug 2005 2:18:02 UTC

Topic 189724

(moderation:

)

My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight.

This is certainly a slow machine -- dual processor 200MHz Pentium Pro. And old server, still being used as a server for my home web and misc. hosting.

Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

Short deadlines

20 Aug 2005 3:42:54 UTC

Message 15687

(moderation:

)

Quote:

My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

I assume you mean 109 HOURS?

Quote:

It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight. This is certainly a slow machine -- dual processor 200MHz Pentium Pro. And old server, still being used as a server for my home web and misc. hosting.

Then the estimate sounds about right...

Quote:

Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

We're starting to issue new workunits (your first server got some of them, in fact) which have a 2-week deadline. This should help.

Director, Einstein@Home

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118515191264

RAC: 26242421

RE: My current Einstein

20 Aug 2005 7:16:29 UTC

Message 15688

(moderation:

)

Quote:

My current Einstein work unit, which I got this evening when I finally got boinc working on the second server, shows 108:59:35 minutes to complete, and a deadline of 8/26 8:54pm. Einstein is one of two secondary projects, configured to get 25% of the resources. Just arithmetically, this does *NOT* look good; in fact I'm surprised the scheduler allowed it.

You were just unlucky in that whilst your older cpuID got a set of the new 2 week work, your newer cpuID got a set of the older, one week work.

Bruce has announced that the deadline will be 14 days for new work but there is still obviously some older work to finish off and you were just unlucky enough to be hit with it. Depending on what version of Boinc you are running, there still may be no real problem as the scheduler should go into EDF mode and process your EAH work first and then refuse to get more EAH work until the other project has its fair share again. As long as you allow Boinc to do it, it should take care of this for you. I notice you have three work units. Two of those should be OK but the third will probably be too stale before it gets a chance to run. You may have to abort it. Make sure your connect to network interval is low (eg the default of 0.1 days) to prevent too early downloading of new work.

Quote:

It's possible this is a case of the early part of the work unit moving very slowly, causing a gross overestimate of time to completing, of course. Otherwise, it seems to me that the deadlines are unreasonably tight.

The estimate before crunching starts is usually much too short. Once crunching starts and things settle down, a fairly accurate number is usually reported. There are 168 hours in a week so work taking 110 hours is possible by Boinc going into EDF mode and temporarily ignoring your resource share. The other projects will get their share later on and it will even out in the long run. For machines this slow, maybe you should consider just one backup project instead of two. That would make things easier for Boinc.

Quote:

Regardless of deadline issues, it seems like workunits over 100 hours are just too damned big. Sometimes the problem doesn't chunk smaller well, I know, but it's still a problem. The global climate predication people are *much* worse; I'm not even *trying* to run them on anything except the new, fast, desktop (which is an illustration of one of the problems; fewer people will run big workunits at all, let alone to completion).

The projects have a scientific problem they want our help with. They are constrained by the science requirements. Do you think that if the calcs could easily be broken down into much smaller bites then they wouldn't do so? The projects are going to do what is most efficient for the science. We just have to be a bit sensible in choosing an appropriate project according to the capabilities of our cpu. I think you would find that P-200s are probably rather thin on the ground as serious workhorses these days. The projects will cater for the type of cpu that most people are likely to have and that would be a little faster than a P-200. But hey, I'm not criticising. My slowest successfully contributing box is a P-100 :). It takes 4 days to do a Seti WU but it works!! I wouldn't dream of trying to run EAH on it.

What three projects are you trying to support and which one is your "main" one?

Cheers,
Gary.

Darrell

Joined: 3 Apr 05

Posts: 12

Credit: 519360047

RAC: 751940

I received two E@H WUs today,

29 Aug 2005 13:49:34 UTC

Message 15689

(moderation:

)

I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.

Ananas

Joined: 22 Jan 05

Posts: 272

Credit: 2500681

RAC: 0

The original idea of those

29 Aug 2005 14:59:57 UTC

Message 15690

(moderation:

)

The original idea of those public DC projects was not to make people buy new machines and have them run 24/7, the original idea was to make people use the *spare* CPU time while they have their boxes running anyway.

Maybe the DC freaks spoiled this idea quite much *blush* so now the project people expect all helpers to use their computers for DC 24/7 and if there are some CPU cycles left might allow different programs.

ABT Chuck P

Joined: 9 Feb 05

Posts: 20

Credit: 363204

RAC: 0

RE: I received two E@H WUs

30 Aug 2005 0:06:25 UTC

Message 15691 in response to message 15689

(moderation:

)

Quote:

I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.

===================
Check task manager and see if something else is using cpu power.

Darrell

Joined: 3 Apr 05

Posts: 12

Credit: 519360047

RAC: 751940

RE: RE: I received two

30 Aug 2005 2:45:35 UTC

Message 15692 in response to message 15691

(moderation:

)

Quote:

Quote:
I received two E@H WUs today, and after consuming 12 and 8 hours of CPU, the estimate to complete is still a total of 340 hours. Since the deadline for each is 9/11 (about 300 hours from now), I don't think the deadlines are realistic for the size of the WUs being sent out. My reconnect time is only 0.2, trying not to receive too many WUs and miss the deadlines.

My system is a HT 3.0GHz P4 w/1GB RAM which runs 24/7 (except when the power goes out), so it is not an old, slow processor.

===================
Check task manager and see if something else is using cpu power.

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.

Joined: 15 Feb 05

Posts: 7

Credit: 25199705

RAC: 0

actually the time boinc

30 Aug 2005 3:12:41 UTC

Message 15693

(moderation:

)

actually the time boinc counts is only the time the process actually gets CPU cycles..
besides that your 3ghz CPU should finish 2 wu in about 12h when hyperthreading. maybe your CPU is getting too hot and therefor its being throttled? afaik p4s get auto-thermal throttled by 80°C core temp.
cant think of anything else atm..

ABT Chuck P

Joined: 9 Feb 05

Posts: 20

Credit: 363204

RAC: 0

RE: RE: RE: Check task

30 Aug 2005 19:42:40 UTC

Message 15694 in response to message 15692

(moderation:

)

Quote:

Quote:
Quote:
Check task manager and see if something else is using cpu power.

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.

==========
Sorry for not being clearer. I was attempting to see if some other process was grabbing more than normal.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118515191264

RAC: 26242421

RE: Yes, I have other

6 Sep 2005 5:18:18 UTC

Message 15695 in response to message 15692

(moderation:

)

Quote:

Yes, I have other tasks running - that is why I have a computer. The idea of BOINC and projects is to use the otherwise UNUSED cycles, and the 12+8 hours are the UNUSED cycles after my normal work. Luckily, the estimates to complete (ETC) are now down to 9+27=36 more hours. I guess the ETC is just wildly incorrect for the first 10-15 CPU hours of crunch time.

The ETC is not usually wrong by very much so I think there is something unusual about how your machine is running. I have a P4 2.6G HT machine that does 2 EAH work units every 13.5 hours approximately. Before crunching starts, the estimate is 10 hours but this quickly extends to 13.5 hours and then stays there once crunching starts. In your case, the poor ETC performance is probably associated with the "No heartbeat" errors listed below.

Here is the stderr output from one of my results:

Quote:

Fstats.Ha: bytecount 1875484 checksum 89912677
Fstats.Hb: bytecount 1492423 checksum 71543681

Here is part of the stderr output from one of your results (ResultID=8035023):

Quote:

Resuming computation at 482/19528/19528
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 510/21688/21868
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 524/21868/22048
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 548/22138/22228
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 38/2340/2340
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 62/3600/3690
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 87/4320/4320
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 116/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 75/4320/4320
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 100/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 124/5130/5130
No heartbeat from core client for 31.000000 sec - exiting
....
Resuming computation at 224/7379/7559
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 249/8819/8819
No heartbeat from core client for 31.000000 sec - exiting
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 74/4230/4230
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 103/5130/5130
Resuming computation at 770/32936/33116
Resuming computation at 1219/58224/58224
Resuming computation at 1881/105831/107181
Resuming computation at 2995/159555/160005
Resuming computation at 3627/177194/177194
Resuming computation at 5141/242621/242621
No heartbeat from core client for 31.000000 sec - exiting
Resuming computation at 7518/370869/370869
No heartbeat from core client for 31.000000 sec - exiting
.....
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 13817/404363/404363
No heartbeat from core client for 31.000000 sec - exiting
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 13930/409673/409853
Fstats.Ha: bytecount 1542973 checksum 73854417
Fstats.Hb: bytecount 947203 checksum 45337763

The "No heartbeat" error message is documented in the Wiki without any positive conclusion being reached as to what exactly causes one of the components to crash or how serious it really is.

The other thing of note is how often the computation keeps getting resumed. Have you enabled the preference for only doing work when the computer is idle? I've never used it on my boxes so that I can only imagine that it would create a string of these messages if it were enabled.

If so, why don't you try letting BOINC and EAH run all the time as they really wont interfere with your normal computing activities. BOINC/EAH is very quick in getting out of the way when you have other serious tasks to perform. You wont really be able to detect that it is running unless you happen to catch it when it is performing one of the auto benchmarks or downloading a new data file. Both of these are quite infrequent occurrances.

Cheers,
Gary.

Short deadlines

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports