Scheduler partially offline

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 252994997

RAC: 42792

1 Sep 2010 10:58:31 UTC

Topic 195302

(moderation:

)

Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 328424149

RAC: 200453

Scheduler partially offline

4 Oct 2010 21:47:12 UTC

Message 99516

(moderation:

)

Actually, since yesterday (Sunday - October 3), the scheduler has been *totally* offline -- which means completed work can't be reported (including work coming up on deadline).

Is this related to the validator (storage location) problem mentioned elsewhere?

I have seen nothing from the project regarding the scheduler going offline for an extended period (as it has been).

Quote:

Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.

BM

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: -- which means

4 Oct 2010 22:08:21 UTC

Message 99517 in response to message 99516

(moderation:

)

Quote:

-- which means completed work can't be reported (including work coming up on deadline).

Sorry to contradict you once again, but reporting is possible, as of 21:00 UTC:[pre]04/10/2010 23:01:11|Einstein@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 1 completed tasks
04/10/2010 23:01:16|Einstein@Home|Scheduler request succeeded: got 0 new tasks
04/10/2010 23:01:16||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_0953.55_S5R4__42_S5GC1a_2[/pre]
GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 333751745

RAC: 368881

RE: Sorry to contradict you

4 Oct 2010 22:18:57 UTC

Message 99518 in response to message 99517

(moderation:

)

Quote:

Sorry to contradict you once again, but reporting is possible ......

Yeah, I haven't been copping any of the 5 minute scheduler pauses over the weekend, either ... I think Barry has just been unlucky.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 328424149

RAC: 200453

Well, I'd rather be badly

4 Oct 2010 22:42:38 UTC

Message 99519 in response to message 99517

(moderation:

)

Well, I'd rather be badly unlucky rather than right on this one:

10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance

I did a successful retry a few minutes later.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 333751745

RAC: 368881

RE: Well, I'd rather be

4 Oct 2010 22:52:58 UTC

Message 99520 in response to message 99519

(moderation:

)

Quote:

Well, I'd rather be badly unlucky rather than right on this one:

10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance

I did a successful retry a few minutes later.

Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'

PS - what's your turnaround time on GPU work?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 328424149

RAC: 200453

Mike -- I run a batch of

4 Oct 2010 23:14:27 UTC

Message 99521 in response to message 99520

(moderation:

)

Mike -- I run a batch of workstations many of which have Einstein running, and some of these have 9800GT GPU's. I will take a look as to run times.

Regarding reporting work -- frankly, I'd not encountered this scheduler issue with Einstein that I could see until the past day. Today, I've seen it on every one of my workstations with Einstein -- and instead of say 5 minutes every hour, I'd see rather extended 'maintenance' failures. Not so much for new work, I've typically a decent cache and since nearly all my workstations 6 or more projects on them I don't ever run out of work.

Over the past few months, I've increased the amount of Einstein work I am processing -- from around 5K credits a day to double that. Mostly shifting cycles from SETI (which is simply too problematic these days), GPUGrid (for those 9800GT workstations), and to a degree from 'low yield' projects like Rosetta and Malaria for CPU only.

The lion's share of my processing though is ATI GPU processing for MilkyWay, Collatz and DNetc.

Regarding the scheduler, my sense is that there may be some other things going on there which for the past day have made it less accessible more often than even say last week.

Quote:

Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'

PS - what's your turnaround time on GPU work?

Cheers, Mike.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 333751745

RAC: 368881

RE: Regarding the

4 Oct 2010 23:59:24 UTC

Message 99522 in response to message 99521

(moderation:

)

Quote:

Regarding the scheduler, my sense is that there may be some other things going on ....

Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 328424149

RAC: 200453

OK -- I thought it might be

5 Oct 2010 1:28:41 UTC

Message 99523 in response to message 99522

(moderation:

)

OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.

Quote:

Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)

Cheers, Mike.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6594

Credit: 333751745

RAC: 368881

RE: OK -- I thought it

5 Oct 2010 2:19:44 UTC

Message 99524 in response to message 99523

(moderation:

)

Quote:

OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.

Actually now that I think about it, we probably haven't quite been explicit enough. My understanding is that : the ~ 70K work units that didn't correctly validate were likely to have nearly all been processed adequately by the user computers ie. with the usual ( low ) error rates. It would be probably silly to send them all out again if so, even though that would 'solve' the problem with fairly minimal admin. You could guess how contributors might not be happy with that though. I, like yourself, have fast machines/bandwidth so that's likely no great insult perhaps, but we have the vast hordes ( ** we luv you!! ** ) not on the bleeding edge to politely cater for as well. Thus the fiddle is to identify those that boned on the relevant naughty validator during the period in question, rinse them again, and not double/re validate those adequately checked elsewhere etc ....

I've peeked at the SQL syntax being discussed. I mean it's not like the DB entries have a 'we stuffed a validator' flag/field to inspect and pluck with. :- )

[ Mike awaits dev wrath via email .... ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 328424149

RAC: 200453

Mike, thanks for your

5 Oct 2010 6:50:22 UTC

Message 99525 in response to message 99524

(moderation:

)

Mike, thanks for your explanation here. For what it is worth, since Thursday, I've seen my pendings go from under 5800 to approaching double that (10.5K), and during that time the daily credits have dropped from 11K to under 6K.

I figure eventually this will flush thru the system and things return to the level I had increased to before this drive mapping issue.

For me, I tend to tolerate project problems when they fit two characteristics == that they are 'rifle shots' rather than 'machine gun' fire (compare Einstein where problems are quite rare to SETI where problems are varied and very frequent); and that there is a sense of someone minding the store (what you are doing here) instead of either denial or even more exasperating -- silence, regarding reports.

So I really appreciate you taking the time to provide explanations here.

Scheduler partially offline

Forums › Technical News

Comment viewing options

Forums › Technical News