I haven't received any today. Crunchers are cold from the lack of work from Seti, Einstein and MilkyWay. They got a little bit of work from GPUGrid but that didn't last long.
I knew there would be troubles when MilkyWay announced that they would be starting regular Tuesday maintenance schedules along with Seti. Always have depended on Einstein to carry the load when my other projects dropped out. Now Einstein is not dependable either. So what's the use of having backup projects when your backup projects don't have work either. Bah humbug!????
They were probably resends - extra copies of failed or deadline miss tasks. You can easily distinguish resends from the initial 'primary' tasks. They will have an _2 (or higher) extension in their name, as opposed to _0 or _1 for the original copies.
There are currently almost 0.25M tasks in progress and at a rough guess, maybe 10-20% of those will fail in some way or not be returned. So even if the workunit generator isn't kicked into life at today's start-of-business in Hannover, there will be quite a few more resends to be picked up if you happen to be asking at just the 'right' time :-).
Only problem is the 'right' time could be anywhere between now and a few weeks into the future :-).
We just have to hope that last night wasn't 'Wild Party Night' in Hannover and that people get to work on time and sufficiently awake to notice :-).
I'm always a bit bemused when one of these 'once in a blue moon' events happens. Other projects have problems/outages more regularly, but if it happens here, this project has suddenly become unreliable?? :-).
There has been a problem in the pre-processing pipeline that we weren't able to fix before the buffers ran dry. We're frantically working on it.
Details: the central file system of our compute cluster Atlas is currently unstable, and apparently neither we nor the vendor support is able to determine the root cause to ultimately fix it. Currently it's working, and if it doesn't tip over in the next 12h we should be able to push at least one or two FGRPB1G datasets trough. Typing this with fingers crossed...
Now Einstein is not dependable either. So what's the use of having backup projects when your backup projects don't have work either. Bah humbug!????
A bit exagerated in my opinion luckily Einstein is a solid project! I just dislike the website graphic but that is just an aesthetic personal preference!
I keep a fairly small work cache on each machine and I have a script that gets all machines to top up to an extra 0.5 days worth for the over-night period. As Murphy's Law would always predict, the well ran dry just in time to make sure that nobody was going to get a drink last evening :-). I hung around for an extra hour or two after Bernd said he was on to it with fingers crossed, but as we all know, crossed fingers can't defeat Murphy :-).
All of mine are pretty much back to normal now. There are two particular problems for me if hosts run dry temporarily. The first is that after trying to get work and being refused, hosts will go into increasingly longer backoffs which means that there can be quite a delay in asking for a drink when the well fills up again. I've solved that one by having an option in one of my scripts that requests the local boinccmd to force an 'update' on each host. Whatever stage of the backoff a host happens to be in, this gets canceled by the 'update' and the host can take an immediate drink.
The other problem is that when tasks finally arrive (with all my GPUs running at least 2 concurrently) concurrent tasks all start together and tend to stay that way. My experience suggests that you don't get the lowest crunch time if tasks start and finish at the same time so I try to make sure that start times are staggered so this inefficiency is avoided as much as possible. I've spent the last few hours, sorting that out manually. I'm now designing a new module that hopefully will be able to do the job automatically.
The other problem is that when tasks finally arrive (with all my GPUs running at least 2 concurrently) concurrent tasks all start together and tend to stay that way. My experience suggests that you don't get the lowest crunch time if tasks start and finish at the same time so I try to make sure that start times are staggered so this inefficiency is avoided as much as possible. I've spent the last few hours, sorting that out manually. I'm now designing a new module that hopefully will be able to do the job automatically.
Gary, While I also try to adjust matters from time to time to get two jobs sharing a GPU to be offset in time, I suspect this is less useful with the current flavor of WU data file than it has been at other times. At least on my Nvidia/Windows 10 platforms, the time spent in the "past the end" portion has shrunk to the vanishing point. I think the main virtue of offset times was allowing one of the two jobs to be running in the main portion while the other was "past then end". As that condition can't occupy any appreciable time at the moment, I suspect taking the trouble to induce offset is unusually unfruitful at the moment.
None of which says anything about what a next batch of WU files might be like.
I just picked up some at
)
I just picked up three at 22:34 UTC for my Ubuntu machine.
But maybe that was the last of them?
I haven't received any
)
I haven't received any today. Crunchers are cold from the lack of work from Seti, Einstein and MilkyWay. They got a little bit of work from GPUGrid but that didn't last long.
I knew there would be troubles when MilkyWay announced that they would be starting regular Tuesday maintenance schedules along with Seti. Always have depended on Einstein to carry the load when my other projects dropped out. Now Einstein is not dependable either. So what's the use of having backup projects when your backup projects don't have work either. Bah humbug!????
Jim1348 wrote:.... maybe that
)
They were probably resends - extra copies of failed or deadline miss tasks. You can easily distinguish resends from the initial 'primary' tasks. They will have an _2 (or higher) extension in their name, as opposed to _0 or _1 for the original copies.
There are currently almost 0.25M tasks in progress and at a rough guess, maybe 10-20% of those will fail in some way or not be returned. So even if the workunit generator isn't kicked into life at today's start-of-business in Hannover, there will be quite a few more resends to be picked up if you happen to be asking at just the 'right' time :-).
Only problem is the 'right' time could be anywhere between now and a few weeks into the future :-).
We just have to hope that last night wasn't 'Wild Party Night' in Hannover and that people get to work on time and sufficiently awake to notice :-).
I'm always a bit bemused when one of these 'once in a blue moon' events happens. Other projects have problems/outages more regularly, but if it happens here, this project has suddenly become unreliable?? :-).
Cheers,
Gary.
There has been a problem in
)
There has been a problem in the pre-processing pipeline that we weren't able to fix before the buffers ran dry. We're frantically working on it.
Details: the central file system of our compute cluster Atlas is currently unstable, and apparently neither we nor the vendor support is able to determine the root cause to ultimately fix it. Currently it's working, and if it doesn't tip over in the next 12h we should be able to push at least one or two FGRPB1G datasets trough. Typing this with fingers crossed...
BM
Keith Myers wrote:Now
)
A bit exagerated in my opinion luckily Einstein is a solid project! I just dislike the website graphic but that is just an aesthetic personal preference!
We seem to have some work
)
We seem to have some work again!! :-).
Cheers,
Gary.
Yep, I didn't run dry life
)
Yep, I didn't run dry life and is good in Einstenland.
Confirming, received some WUs
)
Confirming, received some WUs
Betreger wrote:Yep, I didn't
)
All of mine ran dry.
I keep a fairly small work cache on each machine and I have a script that gets all machines to top up to an extra 0.5 days worth for the over-night period. As Murphy's Law would always predict, the well ran dry just in time to make sure that nobody was going to get a drink last evening :-). I hung around for an extra hour or two after Bernd said he was on to it with fingers crossed, but as we all know, crossed fingers can't defeat Murphy :-).
All of mine are pretty much back to normal now. There are two particular problems for me if hosts run dry temporarily. The first is that after trying to get work and being refused, hosts will go into increasingly longer backoffs which means that there can be quite a delay in asking for a drink when the well fills up again. I've solved that one by having an option in one of my scripts that requests the local boinccmd to force an 'update' on each host. Whatever stage of the backoff a host happens to be in, this gets canceled by the 'update' and the host can take an immediate drink.
The other problem is that when tasks finally arrive (with all my GPUs running at least 2 concurrently) concurrent tasks all start together and tend to stay that way. My experience suggests that you don't get the lowest crunch time if tasks start and finish at the same time so I try to make sure that start times are staggered so this inefficiency is avoided as much as possible. I've spent the last few hours, sorting that out manually. I'm now designing a new module that hopefully will be able to do the job automatically.
Cheers,
Gary.
Gary Roberts wrote:The other
)
Gary, While I also try to adjust matters from time to time to get two jobs sharing a GPU to be offset in time, I suspect this is less useful with the current flavor of WU data file than it has been at other times. At least on my Nvidia/Windows 10 platforms, the time spent in the "past the end" portion has shrunk to the vanishing point. I think the main virtue of offset times was allowing one of the two jobs to be running in the main portion while the other was "past then end". As that condition can't occupy any appreciable time at the moment, I suspect taking the trouble to induce offset is unusually unfruitful at the moment.
None of which says anything about what a next batch of WU files might be like.