// DBOINCP-300: added node comment count condition in order to get Preview working ?>
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250427373
RAC: 34918
1 Sep 2010 10:58:31 UTC
Topic 195302
(moderation:
)
Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.
Actually, since yesterday (Sunday - October 3), the scheduler has been *totally* offline -- which means completed work can't be reported (including work coming up on deadline).
Is this related to the validator (storage location) problem mentioned elsewhere?
I have seen nothing from the project regarding the scheduler going offline for an extended period (as it has been).
Quote:
Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.
-- which means completed work can't be reported (including work coming up on deadline).
Sorry to contradict you once again, but reporting is possible, as of 21:00 UTC:[pre]04/10/2010 23:01:11|Einstein@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 1 completed tasks
04/10/2010 23:01:16|Einstein@Home|Scheduler request succeeded: got 0 new tasks
04/10/2010 23:01:16||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_0953.55_S5R4__42_S5GC1a_2[/pre]
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Well, I'd rather be badly unlucky rather than right on this one:
10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance
I did a successful retry a few minutes later.
Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'
PS - what's your turnaround time on GPU work?
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike -- I run a batch of workstations many of which have Einstein running, and some of these have 9800GT GPU's. I will take a look as to run times.
Regarding reporting work -- frankly, I'd not encountered this scheduler issue with Einstein that I could see until the past day. Today, I've seen it on every one of my workstations with Einstein -- and instead of say 5 minutes every hour, I'd see rather extended 'maintenance' failures. Not so much for new work, I've typically a decent cache and since nearly all my workstations 6 or more projects on them I don't ever run out of work.
Over the past few months, I've increased the amount of Einstein work I am processing -- from around 5K credits a day to double that. Mostly shifting cycles from SETI (which is simply too problematic these days), GPUGrid (for those 9800GT workstations), and to a degree from 'low yield' projects like Rosetta and Malaria for CPU only.
The lion's share of my processing though is ATI GPU processing for MilkyWay, Collatz and DNetc.
Regarding the scheduler, my sense is that there may be some other things going on there which for the past day have made it less accessible more often than even say last week.
Quote:
Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'
Regarding the scheduler, my sense is that there may be some other things going on ....
Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.
Quote:
Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)
OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.
Actually now that I think about it, we probably haven't quite been explicit enough. My understanding is that : the ~ 70K work units that didn't correctly validate were likely to have nearly all been processed adequately by the user computers ie. with the usual ( low ) error rates. It would be probably silly to send them all out again if so, even though that would 'solve' the problem with fairly minimal admin. You could guess how contributors might not be happy with that though. I, like yourself, have fast machines/bandwidth so that's likely no great insult perhaps, but we have the vast hordes ( ** we luv you!! ** ) not on the bleeding edge to politely cater for as well. Thus the fiddle is to identify those that boned on the relevant naughty validator during the period in question, rinse them again, and not double/re validate those adequately checked elsewhere etc ....
I've peeked at the SQL syntax being discussed. I mean it's not like the DB entries have a 'we stuffed a validator' flag/field to inspect and pluck with. :- )
[ Mike awaits dev wrath via email .... ]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike, thanks for your explanation here. For what it is worth, since Thursday, I've seen my pendings go from under 5800 to approaching double that (10.5K), and during that time the daily credits have dropped from 11K to under 6K.
I figure eventually this will flush thru the system and things return to the level I had increased to before this drive mapping issue.
For me, I tend to tolerate project problems when they fit two characteristics == that they are 'rifle shots' rather than 'machine gun' fire (compare Einstein where problems are quite rare to SETI where problems are varied and very frequent); and that there is a sense of someone minding the store (what you are doing here) instead of either denial or even more exasperating -- silence, regarding reports.
So I really appreciate you taking the time to provide explanations here.
Scheduler partially offline
)
Actually, since yesterday (Sunday - October 3), the scheduler has been *totally* offline -- which means completed work can't be reported (including work coming up on deadline).
Is this related to the validator (storage location) problem mentioned elsewhere?
I have seen nothing from the project regarding the scheduler going offline for an extended period (as it has been).
RE: -- which means
)
Sorry to contradict you once again, but reporting is possible, as of 21:00 UTC:[pre]04/10/2010 23:01:11|Einstein@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 1 completed tasks
04/10/2010 23:01:16|Einstein@Home|Scheduler request succeeded: got 0 new tasks
04/10/2010 23:01:16||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_0953.55_S5R4__42_S5GC1a_2[/pre]
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: Sorry to contradict you
)
Yeah, I haven't been copping any of the 5 minute scheduler pauses over the weekend, either ... I think Barry has just been unlucky.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Well, I'd rather be badly
)
Well, I'd rather be badly unlucky rather than right on this one:
10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance
I did a successful retry a few minutes later.
RE: Well, I'd rather be
)
Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'
PS - what's your turnaround time on GPU work?
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike -- I run a batch of
)
Mike -- I run a batch of workstations many of which have Einstein running, and some of these have 9800GT GPU's. I will take a look as to run times.
Regarding reporting work -- frankly, I'd not encountered this scheduler issue with Einstein that I could see until the past day. Today, I've seen it on every one of my workstations with Einstein -- and instead of say 5 minutes every hour, I'd see rather extended 'maintenance' failures. Not so much for new work, I've typically a decent cache and since nearly all my workstations 6 or more projects on them I don't ever run out of work.
Over the past few months, I've increased the amount of Einstein work I am processing -- from around 5K credits a day to double that. Mostly shifting cycles from SETI (which is simply too problematic these days), GPUGrid (for those 9800GT workstations), and to a degree from 'low yield' projects like Rosetta and Malaria for CPU only.
The lion's share of my processing though is ATI GPU processing for MilkyWay, Collatz and DNetc.
Regarding the scheduler, my sense is that there may be some other things going on there which for the past day have made it less accessible more often than even say last week.
RE: Regarding the
)
Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
OK -- I thought it might be
)
OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.
RE: OK -- I thought it
)
Actually now that I think about it, we probably haven't quite been explicit enough. My understanding is that : the ~ 70K work units that didn't correctly validate were likely to have nearly all been processed adequately by the user computers ie. with the usual ( low ) error rates. It would be probably silly to send them all out again if so, even though that would 'solve' the problem with fairly minimal admin. You could guess how contributors might not be happy with that though. I, like yourself, have fast machines/bandwidth so that's likely no great insult perhaps, but we have the vast hordes ( ** we luv you!! ** ) not on the bleeding edge to politely cater for as well. Thus the fiddle is to identify those that boned on the relevant naughty validator during the period in question, rinse them again, and not double/re validate those adequately checked elsewhere etc ....
I've peeked at the SQL syntax being discussed. I mean it's not like the DB entries have a 'we stuffed a validator' flag/field to inspect and pluck with. :- )
[ Mike awaits dev wrath via email .... ]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike, thanks for your
)
Mike, thanks for your explanation here. For what it is worth, since Thursday, I've seen my pendings go from under 5800 to approaching double that (10.5K), and during that time the daily credits have dropped from 11K to under 6K.
I figure eventually this will flush thru the system and things return to the level I had increased to before this drive mapping issue.
For me, I tend to tolerate project problems when they fit two characteristics == that they are 'rifle shots' rather than 'machine gun' fire (compare Einstein where problems are quite rare to SETI where problems are varied and very frequent); and that there is a sense of someone minding the store (what you are doing here) instead of either denial or even more exasperating -- silence, regarding reports.
So I really appreciate you taking the time to provide explanations here.