To all those guys having a real lot of machines -- poor you!

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0
Topic 191079

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:

2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds

4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325253344
RAC: 210665

To all those guys having a real lot of machines -- poor you!

Quote:

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:

2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds

4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.
I am unsure if any other applications on this page that would be suitable similiarly for other platforms. Can anybody comment here on such 'add-ons'.... ?
Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

RE: Well, not exactly.

Message 28055 in response to message 28054

Quote:

Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.

Yes I know, there are applications. But is my special situation I can directly access 5 of my machines. For the others I have no direct access. So even if there is such a tool, It would not help.

However, I do not like an application which needs a nurse to check over and over again. Maybe you have read that famous UNIX haters handbook? I had access to a printed one somewhen in '94 or '95. One thing they did not like on Unix are those core files. After a while, you can find in almost every directory such a file (okay, this does not happen any more, but in these days it was true). And that behaviour with core files is a simlar example where a computer needs a nurse to clean up the crap over and over again.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325253344
RAC: 210665

RE: Yes I know, there are

Message 28056 in response to message 28055

Quote:

Yes I know, there are applications. But is my special situation I can directly access 5 of my machines. For the others I have no direct access. So even if there is such a tool, It would not help.

However, I do not like an application which needs a nurse to check over and over again. Maybe you have read that famous UNIX haters handbook? I had access to a printed one somewhen in '94 or '95. One thing they did not like on Unix are those core files. After a while, you can find in almost every directory such a file (okay, this does not happen any more, but in these days it was true). And that behaviour with core files is a simlar example where a computer needs a nurse to clean up the crap over and over again.


I hear you! I take your point that it is odd that after a mere 4 failed tries you wind up at such a large delay of 604800 seconds. I wonder if the delay was already high before those occurred ..... does anybody know the algorithm on that? Pure geometric or what?
I have read the good old "Mythical Man Month" and I'll certainly have a peek at that UNIX book! I wonder if I can get a 'free Unix Barf Bag' though ...... :-)
Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7277328383
RAC: 1943154

RE: I hear you! I take your

Message 28057 in response to message 28056

Quote:
I hear you! I take your point that it is odd that after a mere 4 failed tries you wind up at such a large delay of 604800 seconds. I wonder if the delay was already high before those occurred ..... does anybody know the algorithm on that? Pure geometric or what?
I have read the good old "Mythical Man Month" and I'll certainly have a peek at that UNIX book! I wonder if I can get a 'free Unix Barf Bag' though ...... :-)
Cheers, Mike.

Two of my four machines finished the night displaying a six day wait for retry.

Reviewing the message log for of them, it had tried between 9:15 and 11:45 p.m. MDT to download new work and to upload completed results. The upload requests display backdowns gradually increasing to a bit over an hour for the longest display backdown before the (separate) new work/reporting 1 result request at 11:43 which earns the dreaded:

couldn't connect to server
...
no schedulers responded
...
fetching scheduler list
network error: couldn't connect to server
scheduler list fetch failed: http error
4 consecutive failure fetching scheduler list: deferring 604800 seconds

Yes issuing a update this morning got my two laggards work again. And I do recall from the SETI extended downtime last year that excess requests can be a problem on restart, but this seems extreme.

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

RE: Yes issuing a update

Message 28058 in response to message 28057

Quote:

Yes issuing a update this morning got my two laggards work again. And I do recall from the SETI extended downtime last year that excess requests can be a problem on restart, but this seems extreme.

Thanks! At least another victim :-)

I will not manually update today, instead I will watch what happens when E@H runs out of work. I have a cache of ~0.7 days, so in 5 hours one of the CPUs and in 6 hours the second CPU will be idle.

Let me see, maybe boinc is clever enough to ignore the delay, maybe not. Whatever happens, I will do a manual update tomorrow morning.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Well, not exactly.

Message 28059 in response to message 28054

Quote:
Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.

With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

RE: Hi guys, this may

Quote:

Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.

The reason is this message:

2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds

4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.

And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.

Thanks Boinc for this grand logic.

So admins in the wolrd, start your mouse and check every single box! Have fun!

Or let boinc do its thing on its own.

AndyK
AndyK
Joined: 5 Jan 06
Posts: 21
Credit: 44767
RAC: 0

RE: With BoincView you

Message 28061 in response to message 28059

Quote:

With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Of course you can!
On the right side of the retry file transfer button, there is a down-arrow. try this one and you'll see a menu entry called: retry all file transfers

AndyK

Want to know your pending credit?

[img]http://tinyurl.com/438v3"[/img]
The biggest bug is sitting 10 inch in front of the screen.

Steve Cressman
Steve Cressman
Joined: 9 Feb 05
Posts: 104
Credit: 139654
RAC: 0

Not sure but I think the

Not sure but I think the scheduler after 10 failed attempts to get work it trys to get the master file(scheduler list) to make sure that it is trying the right address. Then ten more attempts before it tries to get master file again. This is repeated until after the 4th failure to get the master file at which point it backs off for a week(604800sec). There are also incremental backoffs between attempts to get work.

So it looks like the op of this thread must have been hitting the update button in order to get to that point because I don't think the outage was long enough to drive it to that point.

However I do agree that a week long backoff is excessive. After 4 failures to get master file it should backoff no more than 24hrs IMO.

98SE XP2500+ @ 2.1 GHz Boinc v5.8.8

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: RE: With BoincView

Message 28063 in response to message 28061

Quote:
Quote:

With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.

cu,
Michael

Of course you can!
On the right side of the retry file transfer button, there is a down-arrow. try this one and you'll see a menu entry called: retry all file transfers

AndyK

Thx! I never tried this. ;)

cu,
Michael

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.