Work Buffer Not Filled

Stan Pope
Stan Pope
Joined: 22 Dec 05
Posts: 80
Credit: 426811575
RAC: 0
Topic 194393

Since a disk crash about 10 days ago, the work buffer on that repaired machine only will fill partially. The total estimated time for work in the buffer is well below (about 30% of) the time that should be available.

To repair from the crash I installed new hard drive, reinstalled Windows, and reinstalled BOINC 6.4.7 from distribution media. I used the same HOST NAME and BOINC/E@H recognized the machine as the same one from before the crash. (That seemed strange to me ... usually I have to MERGE machines to incorporate the history from prior to a crash. I don't think that I initiated a MERGE, but since I did the rebuild in the middle of the night, I might not remember correctly.)

The machine (hostid=1945350) task list shows "host detached" for the tasks that were in the buffer at the time of the crash. It is as though the "host detached" tasks were being counted as using up part of the buffer capacity.

Those "host detached" tasks do not appear in my buffer now (unless BOINC has recovered the info and hidden them somewhere ... they do not show in the BOINC Manager task list.)

Is this "normal?" Has BOINC or E@H put the machine "on probation" until it shows that it can stay up for more than a couple weeks? Will the situation "self-correct" or should I be doing something such as "Reset Project"?

Stan

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 799988979
RAC: 1212589

Work Buffer Not Filled

I guess bOINC also needs some time to recallibrate its estimation of your PCs throughput in terms of results returned per day. I'd wait a bit and see how this evolves.

CU
Bikeman

Stan Pope
Stan Pope
Joined: 22 Dec 05
Posts: 80
Credit: 426811575
RAC: 0

RE: I guess bOINC also

Message 93377 in response to message 93376

Quote:

I guess bOINC also needs some time to recallibrate its estimation of your PCs throughput in terms of results returned per day. I'd wait a bit and see how this evolves.

CU
Bikeman


Thank you. Absent strong contraindications, that will be my path.

The aspect that puzzles me is that time estimates shown in Boinc Manager for each WU are comparable to those on an almost identical Q9550 that is running a full buffer, albeit an earlier release of BOINC Manager (5.10.13). Also, I ompared the various "factors" stored in the machines' profiles and they seem comparable. So, BOINC must have a variable hidden away somewhere that tells it to drag its feet to refill this machine's buffer for a while ... cuz it might still be an unreliable machine. :)

Stan

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: ...So, BOINC must have

Message 93378 in response to message 93377

Quote:
...So, BOINC must have a variable hidden away somewhere that tells it to drag its feet to refill this machine's buffer for a while ... cuz it might still be an unreliable machine. :)


What abaout the (near the bottom of Computer summary)? Are the values near 1 (resp. 100%)?

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 0

RE: cuz it might still be

Message 93379 in response to message 93377

Quote:
cuz it might still be an unreliable machine. :)


No, your machine is designated as reliable. You're just getting those Arecibo tasks which take quite a bit longer than the Hierarchical ones.

As per the last scheduler log:

Quote:
2009-06-13 20:50:35.3908 [PID=17566] Request: [HOST#1945350] client 6.4.7
2009-06-13 20:50:35.5585 [PID=17566] [send] CPU: req 0.00 sec, 0.00 instances; est delay 125481.79
2009-06-13 20:50:35.5585 [PID=17566] [send] work_req_seconds: 145.27 secs
2009-06-13 20:50:35.5585 [PID=17566] [send] Not using matchmaker scheduling; Not using EDF sim
2009-06-13 20:50:35.5585 [PID=17566] [send] available disk 2.30 GB, work_buf_min 0
2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16
2009-06-13 20:50:35.5624 [PID=17566] [send] set_trust: random choice for error rate 0.000010: yes
2009-06-13 20:50:35.5624 [PID=17566] [mixed] sending non-locality work first
2009-06-13 20:50:35.5626 [PID=17566] [send] est. duration for WU 54220777: unscaled 18362.44 scaled 20880.30
2009-06-13 20:50:35.5626 [PID=17566] [send] [WU#54220777] meets deadline: 125481.79 + 20880.30 < 1209600
2009-06-13 20:50:35.5640 [PID=17566] [debug] Sorted list of URLs follows [host timezone: UTC-18000]
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=-28800 url=http://einstein.ligo.caltech.edu
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=+00000 url=http://einstein.astro.gla.ac.uk
2009-06-13 20:50:35.5640 [PID=17566] [debug] zone=+03600 url=http://einstein.aei.mpg.de
2009-06-13 20:50:35.5642 [PID=17566] [send] [HOST#1945350] Sending app_version einsteinbinary_ABP1 2 305 ; 3.00 GFLOPS
2009-06-13 20:50:35.5813 [PID=17566] [send] est. duration for WU 54220777: unscaled 18362.44 scaled 20880.30
2009-06-13 20:50:35.5813 [PID=17566] [HOST#1945350] Sending [RESULT#130234453 p2030_53839_35673_0025_G41.59-00.35.C_4.dm_241_1] (est. dur. 20880.30 seconds)
2009-06-13 20:50:35.5982 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5982 [PID=17566] [mixed] sending locality work second
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file skygrid_0840Hz_S5R5.dat
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.40_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.40_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.45_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.45_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.50_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.50_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.55_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.55_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.60_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.60_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.65_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file l1_0833.65_S5R4
2009-06-13 20:50:35.5982 [PID=17566] [locality] [HOST#1945350] has file h1_0833.70_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file l1_0833.70_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file h1_0833.75_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [locality] [HOST#1945350] has file l1_0833.75_S5R4
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.5983 [PID=17566] [send] don't need more work
2009-06-13 20:50:35.6147 [PID=17566] Sending reply to [HOST#1945350]: 1 results, delay req 60.00
2009-06-13 20:50:35.6149 [PID=17566] Scheduler ran 0.500 seconds
Stan Pope
Stan Pope
Joined: 22 Dec 05
Posts: 80
Credit: 426811575
RAC: 0

RE: RE: cuz it might

Message 93380 in response to message 93379

Quote:
Quote:
cuz it might still be an unreliable machine. :)

No, your machine is designated as reliable. You're just getting those Arecibo tasks which take quite a bit longer than the Hierarchical ones.
...


I'm not following the "Arecibo task" logic ...

Here are buffers from two nearly identical machines:

Problem machine ID 1945350: Time estimates for the 11 Arecibo WU's range 5:45 to 6:06 and for the 18 usual jobs 3:44 to 4:54 in its buffer. (29 total WU's, aprox 37 wall clock hours)

On an almost identical Q9550, ID=1719080, time estimates for its buffer of 29 Arecibo WU's range 5:55-6:10 and for 79 usual WU's range 4:10 to 5:38. (108 total WU's, approx. 145 wall clock hours.)

37 hours is approx 25% of the other machine's buffer.

Stan

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: I'm not following the

Message 93381 in response to message 93380

Quote:
I'm not following the "Arecibo task" logic ...


And those two machines are in the same venue and have comparable Duration Correction Factors?

Computer sind nicht alles im Leben. (Kleiner Scherz)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 3002365178
RAC: 708261

RE: RE: I'm not following

Message 93382 in response to message 93381

Quote:
Quote:
I'm not following the "Arecibo task" logic ...

And those two machines are in the same venue and have comparable Duration Correction Factors?


The DCFs are good enough, judging by the time estimates quoted.

The question is, are

    0.443432
    0.999902


as bad as this machine (someone switched it off last week....)

Stan, you (and only you) can see those figures as

Quote:
% of time BOINC client is running 44.3432 %
While BOINC running, % of time work is allowed 99.9902 %


on the 'Computer summary' pages on this website. If one of the figures is low on one machine but not the other, it would have the effect you're describing.

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: The DCFs are good

Message 93383 in response to message 93382

Quote:
The DCFs are good enough, judging by the time estimates quoted...


From Jord's post, we know the values of host 1945350:

Quote:
2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16


So, I wanted to know if the other host has comparable values and if both are in the same venue.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 0

RE: So, I wanted to know if

Message 93384 in response to message 93383

Quote:
So, I wanted to know if the other host has comparable values and if both are in the same venue.


You can look those values up in the scheduler log.
Searching for 1719080 in 2009-06-14_13:14.txt: active_frac 0.999915 on_frac 0.999372 DCF 1.131630

You just can't see the venue. Checking the new value for 1945350 its active_frac 0.999926 on_frac 0.999056 DCF 1.073404 (and it made contact 3 minutes earlier).

Perhaps that client versions also matter. 1945350 is using 6.4.7, 1719080 is using 5.10.13

But still, running both the new and longer Arecibo (Ar) search and the old and shorter Hierarchical (Hi) searches on the same DCF will make that DCF bounce up and down. The Estimated time to completion numbers aren't reliable as finishing an old Hi task will change the estimated time to completion of the longer Ar tasks.

Stan Pope
Stan Pope
Joined: 22 Dec 05
Posts: 80
Credit: 426811575
RAC: 0

RE: RE: The DCFs are good

Message 93385 in response to message 93383

Quote:
Quote:
The DCFs are good enough, judging by the time estimates quoted...

From Jord's post, we know the values of host 1945350:
Quote:
2009-06-13 20:50:35.5585 [PID=17566] [send] active_frac 0.999921 on_frac 0.998990 DCF 1.135882
2009-06-13 20:50:35.5623 [PID=17566] [send] [HOST#1945350] is reliable; OS: Microsoft Windows Vista, error_rate: 0.000010, avg_turn_hrs: 41.061 max res/day 16

So, I wanted to know if the other host has comparable values and if both are in the same venue.

Gruß,
Gundolf

From bad machine's state file:

0.999057
1.000000
0.999926

That is, the values are approximmately 100%.

The good machine's valuse are comparable;

0.999375
-1.000000
0.999915
0.993433

Both machines run 24X7 and have no outages during the past week. Bad machine's last outage was disk crash/rebuild about a week or so ago.

Both machines are on my home LAN, just a couple of ethernet switches away from each other (all of the 1000baseT machines are grouped on switches separate from the 100baseT machines.

Stan

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.