Work Buffer Not Filled

Stan Pope

Joined: 22 Dec 05

Posts: 80

Credit: 426811575

RAC: 0

12 Jun 2009 16:08:06 UTC

Topic 194393

(moderation:

)

Since a disk crash about 10 days ago, the work buffer on that repaired machine only will fill partially. The total estimated time for work in the buffer is well below (about 30% of) the time that should be available.

To repair from the crash I installed new hard drive, reinstalled Windows, and reinstalled BOINC 6.4.7 from distribution media. I used the same HOST NAME and BOINC/E@H recognized the machine as the same one from before the crash. (That seemed strange to me ... usually I have to MERGE machines to incorporate the history from prior to a crash. I don't think that I initiated a MERGE, but since I did the rebuild in the middle of the night, I might not remember correctly.)

The machine (hostid=1945350) task list shows "host detached" for the tasks that were in the buffer at the time of the crash. It is as though the "host detached" tasks were being counted as using up part of the buffer capacity.

Those "host detached" tasks do not appear in my buffer now (unless BOINC has recovered the info and hidden them somewhere ... they do not show in the BOINC Manager task list.)

Is this "normal?" Has BOINC or E@H put the machine "on probation" until it shows that it can stay up for more than a couple weeks? Will the situation "self-correct" or should I be doing something such as "Reset Project"?

Stan

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 799988979

RAC: 1212589

Work Buffer Not Filled

13 Jun 2009 16:14:37 UTC

Message 93376

(moderation:

)

I guess bOINC also needs some time to recallibrate its estimation of your PCs throughput in terms of results returned per day. I'd wait a bit and see how this evolves.

CU
Bikeman

Stan Pope

Joined: 22 Dec 05

Posts: 80

Credit: 426811575

RAC: 0

RE: I guess bOINC also

13 Jun 2009 19:13:38 UTC

Message 93377 in response to message 93376

(moderation:

)

Quote:

I guess bOINC also needs some time to recallibrate its estimation of your PCs throughput in terms of results returned per day. I'd wait a bit and see how this evolves.

CU
Bikeman

Thank you. Absent strong contraindications, that will be my path.

The aspect that puzzles me is that time estimates shown in Boinc Manager for each WU are comparable to those on an almost identical Q9550 that is running a full buffer, albeit an earlier release of BOINC Manager (5.10.13). Also, I ompared the various "factors" stored in the machines' profiles and they seem comparable. So, BOINC must have a variable hidden away somewhere that tells it to drag its feet to refill this machine's buffer for a while ... cuz it might still be an unreliable machine. :)

Stan

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: ...So, BOINC must have

13 Jun 2009 19:18:33 UTC

Message 93378 in response to message 93377

(moderation:

)

Quote:

...So, BOINC must have a variable hidden away somewhere that tells it to drag its feet to refill this machine's buffer for a while ... cuz it might still be an unreliable machine. :)

What abaout the (near the bottom of Computer summary)? Are the values near 1 (resp. 100%)?

GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 0

RE: cuz it might still be

13 Jun 2009 21:50:13 UTC

Message 93379 in response to message 93377

(moderation:

)

Quote:

cuz it might still be an unreliable machine. :)

No, your machine is designated as reliable. You're just getting those Arecibo tasks which take quite a bit longer than the Hierarchical ones.

As per the last scheduler log:

Quote:

2009-06-13 20:50:35.3908 [PID=17566] Request: [HOST#1945350] client 6.4.7
2009-06-13 20:50:35.5585 [PID=17566] [send] CPU: req 0.00 sec, 0.00 instances; est delay 125481.79
2009-06-13 20:50:35.5585 [PID=17566] [send] work_req_seconds: 145.27 secs
2009-06-13 20:50:35.5585 [PID=17566] [send] Not using matchmaker scheduling; Not using EDF sim
2009-06-13 20:50:35.5585 [PID=17566] [send] available disk 2.30 GB, work_buf_min 0
2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16
2009-06-13 20:50:35.5624 [PID=17566] [send] set_trust: random choice for error rate 0.000010: yes
2009-06-13 20:50:35.5624 [PID=17566] [mixed] sending non-locality work first
2009-06-13 20:50:35.5626 [PID=17566] [send] est. duration for WU 54220777: unscaled 18362.44 scaled 20880.30
2009-06-13 20:50:35.5626 [PID=17566] [send] [WU#54220777] meets deadline: 125481.79 + 20880.30 < 1209600
2009-06-13 20:50:35.5640 [PID=17566] [debug] Sorted list of URLs follows [host timezone: UTC-18000]
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=-28800 url=http://einstein.ligo.caltech.edu
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=+00000 url=http://einstein.astro.gla.ac.uk
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=+03600 url=http://einstein.aei.mpg.de
2009-06-13 20:50:35.5642 [PID=17566] [send] [HOST#1945350] Sending app_version einsteinbinary_ABP1 2 305 ; 3.00 GFLOPS
2009-06-13 20:50:35.5813 [PID=17566] [send] est. duration for WU 54220777: unscaled 18362.44 scaled 20880.30
2009-06-13 20:50:35.5813 [PID=17566] [HOST#1945350] Sending [RESULT#130234453 p2030_53839_35673_0025_G41.59-00.35.C_4.dm_241_1] (est. dur. 20880.30 seconds)
2009-06-13 20:50:35.5982 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5982 [PID=17566] [mixed] sending locality work second
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file skygrid_0840Hz_S5R5.dat
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.40_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.40_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.45_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.45_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.50_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.50_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.55_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.55_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.60_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.60_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.65_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.65_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.70_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file l1_0833.70_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file h1_0833.75_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file l1_0833.75_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.6147 [PID=17566] Sending reply to [HOST#1945350]: 1 results, delay req 60.00
2009-06-13 20:50:35.6149 [PID=17566] Scheduler ran 0.500 seconds

Stan Pope

Joined: 22 Dec 05

Posts: 80

Credit: 426811575

RAC: 0

RE: RE: cuz it might

14 Jun 2009 4:45:24 UTC

Message 93380 in response to message 93379

(moderation:

)

Quote:

Quote:
cuz it might still be an unreliable machine. :)

No, your machine is designated as reliable. You're just getting those Arecibo tasks which take quite a bit longer than the Hierarchical ones.
...

I'm not following the "Arecibo task" logic ...

Here are buffers from two nearly identical machines:

Problem machine ID 1945350: Time estimates for the 11 Arecibo WU's range 5:45 to 6:06 and for the 18 usual jobs 3:44 to 4:54 in its buffer. (29 total WU's, aprox 37 wall clock hours)

On an almost identical Q9550, ID=1719080, time estimates for its buffer of 29 Arecibo WU's range 5:55-6:10 and for 79 usual WU's range 4:10 to 5:38. (108 total WU's, approx. 145 wall clock hours.)

37 hours is approx 25% of the other machine's buffer.

Stan

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: I'm not following the

14 Jun 2009 9:13:55 UTC

Message 93381 in response to message 93380

(moderation:

)

Quote:

I'm not following the "Arecibo task" logic ...

And those two machines are in the same venue and have comparable Duration Correction Factors?

Computer sind nicht alles im Leben. (Kleiner Scherz)

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3002365178

RAC: 708261

RE: RE: I'm not following

14 Jun 2009 9:36:45 UTC

Message 93382 in response to message 93381

(moderation:

)

Quote:

Quote:
I'm not following the "Arecibo task" logic ...

And those two machines are in the same venue and have comparable Duration Correction Factors?

The DCFs are good enough, judging by the time estimates quoted.

The question is, are

    0.443432
    0.999902

as bad as this machine (someone switched it off last week....)

Stan, you (and only you) can see those figures as

Quote:

% of time BOINC client is running 44.3432 %
While BOINC running, % of time work is allowed 99.9902 %

on the 'Computer summary' pages on this website. If one of the figures is low on one machine but not the other, it would have the effect you're describing.

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: The DCFs are good

14 Jun 2009 10:16:20 UTC

Message 93383 in response to message 93382

(moderation:

)

Quote:

The DCFs are good enough, judging by the time estimates quoted...

From Jord's post, we know the values of host 1945350:

Quote:

2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16

So, I wanted to know if the other host has comparable values and if both are in the same venue.

GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 0

RE: So, I wanted to know if

14 Jun 2009 13:33:21 UTC

Message 93384 in response to message 93383

(moderation:

)

Quote:

So, I wanted to know if the other host has comparable values and if both are in the same venue.

You can look those values up in the scheduler log.
Searching for 1719080 in 2009-06-14_13:14.txt: active_frac 0.999915 on_frac 0.999372 DCF 1.131630

You just can't see the venue. Checking the new value for 1945350 its active_frac 0.999926 on_frac 0.999056 DCF 1.073404 (and it made contact 3 minutes earlier).

Perhaps that client versions also matter. 1945350 is using 6.4.7, 1719080 is using 5.10.13

But still, running both the new and longer Arecibo (Ar) search and the old and shorter Hierarchical (Hi) searches on the same DCF will make that DCF bounce up and down. The Estimated time to completion numbers aren't reliable as finishing an old Hi task will change the estimated time to completion of the longer Ar tasks.

Stan Pope

Joined: 22 Dec 05

Posts: 80

Credit: 426811575

RAC: 0

RE: RE: The DCFs are good

14 Jun 2009 14:16:02 UTC

Message 93385 in response to message 93383

(moderation:

)

Quote:

Quote:
The DCFs are good enough, judging by the time estimates quoted...

From Jord's post, we know the values of host 1945350:
Quote:
2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16

So, I wanted to know if the other host has comparable values and if both are in the same venue.

GruÃŸ,
Gundolf

From bad machine's state file:

0.999057
1.000000
0.999926

That is, the values are approximmately 100%.

The good machine's valuse are comparable;

0.999375
-1.000000
0.999915
0.993433

Both machines run 24X7 and have no outages during the past week. Bad machine's last outage was disk crash/rebuild about a week or so ago.

Both machines are on my home LAN, just a couple of ethernet switches away from each other (all of the 1000baseT machines are grouped on switches separate from the 100baseT machines.

Stan

Work Buffer Not Filled

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports