To all those guys having a real lot of machines -- poor you!

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

13 Apr 2006 11:40:59 UTC

Topic 191079

(moderation:

)

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:

2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds

4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 335598128

RAC: 402643

To all those guys having a real lot of machines -- poor you!

13 Apr 2006 11:55:49 UTC

Message 28054

(moderation:

)

Quote:

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.
I am unsure if any other applications on this page that would be suitable similiarly for other platforms. Can anybody comment here on such 'add-ons'.... ?
Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

RE: Well, not exactly.

13 Apr 2006 12:05:23 UTC

Message 28055 in response to message 28054

(moderation:

)

Quote:

Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.

Yes I know, there are applications. But is my special situation I can directly access 5 of my machines. For the others I have no direct access. So even if there is such a tool, It would not help.

However, I do not like an application which needs a nurse to check over and over again. Maybe you have read that famous UNIX haters handbook? I had access to a printed one somewhen in '94 or '95. One thing they did not like on Unix are those core files. After a while, you can find in almost every directory such a file (okay, this does not happen any more, but in these days it was true). And that behaviour with core files is a simlar example where a computer needs a nurse to clean up the crap over and over again.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 335598128

RAC: 402643

RE: Yes I know, there are

13 Apr 2006 12:40:58 UTC

Message 28056 in response to message 28055

(moderation:

)

Quote:

Yes I know, there are applications. But is my special situation I can directly access 5 of my machines. For the others I have no direct access. So even if there is such a tool, It would not help.

However, I do not like an application which needs a nurse to check over and over again. Maybe you have read that famous UNIX haters handbook? I had access to a printed one somewhen in '94 or '95. One thing they did not like on Unix are those core files. After a while, you can find in almost every directory such a file (okay, this does not happen any more, but in these days it was true). And that behaviour with core files is a simlar example where a computer needs a nurse to clean up the crap over and over again.

I hear you! I take your point that it is odd that after a mere 4 failed tries you wind up at such a large delay of 604800 seconds. I wonder if the delay was already high before those occurred ..... does anybody know the algorithm on that? Pure geometric or what?
I have read the good old "Mythical Man Month" and I'll certainly have a peek at that UNIX book! I wonder if I can get a 'free Unix Barf Bag' though ...... :-)
Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86

Joined: 6 Dec 05

Posts: 3164

Credit: 7364961687

RAC: 2283850

RE: I hear you! I take your

13 Apr 2006 13:16:10 UTC

Message 28057 in response to message 28056

(moderation:

)

Quote:

I hear you! I take your point that it is odd that after a mere 4 failed tries you wind up at such a large delay of 604800 seconds. I wonder if the delay was already high before those occurred ..... does anybody know the algorithm on that? Pure geometric or what?
I have read the good old "Mythical Man Month" and I'll certainly have a peek at that UNIX book! I wonder if I can get a 'free Unix Barf Bag' though ...... :-)
Cheers, Mike.

Two of my four machines finished the night displaying a six day wait for retry.

Reviewing the message log for of them, it had tried between 9:15 and 11:45 p.m. MDT to download new work and to upload completed results. The upload requests display backdowns gradually increasing to a bit over an hour for the longest display backdown before the (separate) new work/reporting 1 result request at 11:43 which earns the dreaded:

couldn't connect to server
...
no schedulers responded
...
fetching scheduler list
network error: couldn't connect to server
scheduler list fetch failed: http error
4 consecutive failure fetching scheduler list: deferring 604800 seconds

Yes issuing a update this morning got my two laggards work again. And I do recall from the SETI extended downtime last year that excess requests can be a problem on restart, but this seems extreme.

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

RE: Yes issuing a update

13 Apr 2006 13:23:19 UTC

Message 28058 in response to message 28057

(moderation:

)

Quote:

Yes issuing a update this morning got my two laggards work again. And I do recall from the SETI extended downtime last year that excess requests can be a problem on restart, but this seems extreme.

Thanks! At least another victim :-)

I will not manually update today, instead I will watch what happens when E@H runs out of work. I have a cache of ~0.7 days, so in 5 hours one of the CPUs and in 6 hours the second CPU will be idle.

Let me see, maybe boinc is clever enough to ignore the delay, maybe not. Whatever happens, I will do a manual update tomorrow morning.

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

RE: Well, not exactly.

13 Apr 2006 13:24:27 UTC

Message 28059 in response to message 28054

(moderation:

)

Quote:

Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.

With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Trog Dog

Joined: 25 Nov 05

Posts: 191

Credit: 541562

RAC: 0

RE: Hi guys, this may

13 Apr 2006 14:03:06 UTC

Message 28060

(moderation:

)

Quote:

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Or let boinc do its thing on its own.

AndyK

Joined: 5 Jan 06

Posts: 21

Credit: 44767

RAC: 0

RE: With BoincView you

13 Apr 2006 14:29:44 UTC

Message 28061 in response to message 28059

(moderation:

)

Quote:

With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Of course you can!
On the right side of the retry file transfer button, there is a down-arrow. try this one and you'll see a menu entry called: retry all file transfers

AndyK

Want to know your pending credit?

[img]http://tinyurl.com/438v3"[/img]
The biggest bug is sitting 10 inch in front of the screen.

Steve Cressman

Joined: 9 Feb 05

Posts: 104

Credit: 139654

RAC: 0

Not sure but I think the

13 Apr 2006 16:19:00 UTC

Message 28062

(moderation:

)

Not sure but I think the scheduler after 10 failed attempts to get work it trys to get the master file(scheduler list) to make sure that it is trying the right address. Then ten more attempts before it tries to get master file again. This is repeated until after the 4th failure to get the master file at which point it backs off for a week(604800sec). There are also incremental backoffs between attempts to get work.

So it looks like the op of this thread must have been hitting the update button in order to get to that point because I don't think the outage was long enough to drive it to that point.

However I do agree that a week long backoff is excessive. After 4 failures to get master file it should backoff no more than 24hrs IMO.

98SE XP2500+ @ 2.1 GHz Boinc v5.8.8

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

RE: RE: With BoincView

13 Apr 2006 16:44:42 UTC

Message 28063 in response to message 28061

(moderation:

)

Quote:

Quote:
With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Of course you can!
On the right side of the retry file transfer button, there is a down-arrow. try this one and you'll see a menu entry called: retry all file transfers

AndyK

Thx! I never tried this. ;)

cu,
Michael

To all those guys having a real lot of machines -- poor you!

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner