Indeed "large" is a good idea. I'm a bit reluctant to use "x8" or sth, because we may want to adjust the "bundle size" later.
I updated the 'fpops_estimation' of the old workunits - indeed that makes perfect sense for the additional replications necessary because of the tasks erroring out.
I noticed that all of the latest Arecibo, large work units went into High Priority mode.
Has the deadline for the longer work units been adjusted?
I'm not worried about missing credit but the 2-3 day deadline on long run work units does mess up the BOINC scheduler for crunching other projects when the estimated run time is over 2 days per unit.
Has the deadline for the longer work units been adjusted?
The deadline for Arecibo, large is 7 days - probably because they want the work back quickly. It's been that right from the start so I don't think that will change.
I took a peek at one of your hosts - the one with the RTX 3060. You don't have a large number of outstanding tasks but you seem to be running every single EAH search possible so the single DCF (duration correction factor) that Einstein uses for all searches is going to play havoc with the estimates for different searches and give a much increased risk for BOINC to decide to use panic mode when it's not really needed.
You should consider reducing the number of different EAH searches you contribute to, perhaps choosing ones where the estimates are as close as possible to reality in order to reduce fluctuations in estimates as the single DCF gets yanked around.
If you really want to keep all searches (some are going to drop out and there's a new BRP7 search coming shortly) the best way to stop BOINC going into panic mode is to make your work cache setting extremely small - 0.1 days should be good enough to do that - until things settle. Just realise that DCF fluctuations will drive up some estimates from time to time but the low number of tasks on hand should not cause BOINC to panic.
I realise that such a low work cache setting might not be appropriate for other projects you might be supporting so the only alternative is to get rid of EAH searches that are causing the DCF swings. By comparing the estimates you see with the actual crunch times when completed, it should be obvious which searches are causing the problem.
you seem to be running every single EAH search possible so the single DCF (duration correction factor) that Einstein uses for all searches is going to play havoc with the estimates for different searches and give a much increased risk for BOINC to decide to use panic mode when it's not really needed.
Thanks for the response Gary,
It looks like a few events coincided to cause my BOINC scheduling issue, now that I have had a chance to work on some of the supposed 2 day runtime units I can see that it was just a gross over estimate.
At the time of writing the last post I had received three separate batches of Arecibo, long units with deadlines 9th, 10th, and 11th of May. The work units due on the 9th were predicted to have a runtime of 8 hours, the newer work units had an estimated runtime of over 48 hours. In reality all of the Arecibo, long work units have taken under 8 hours and the DCF is starting to reduce for the remaining units.
Most of the time BOINC doesn't request CPU work units from Einstein@home on this PC because the project gets enough credit share from the GPU work units. A work shortage from several CPU based projects at once had left the work cache low so BOINC requested CPU work from Einstein@home as well, then panicked at the over estimated run time.
I temporarily suspended a subset of the Arecibo work units to free up a couple of work slots for other projects while the backlog cleared and now everything is running normally again.
I also reduced the additional work cache from 0.5 to 0.3 so that I am less likely to get so much work at once from any project.
... So the FGRP searches/applications will eventually run out of work.
Looks like that may have just happened - at least it has for FGRPB1G.
For the last few days, it's been a struggle to get work and of late for the scheduler to accept completed work when reported. Most scheduler requests of any sort have timed out with various sorts of error messages. Hosts have run dry, and then got a trickle and quickly run dry again. Multi-hour work request backoffs have been the norm. The forums have been pretty well non-functional - many pages failing to load.
As Tom has discovered recently, all of a sudden the forums are back and so are lightning fast scheduler responses. Any completed tasks are immediately gobbled up and no replacements are issued. Responses that timed out after a couple of minutes are now suddenly completed in seconds. Are these FGRPB1G tasks now finished?? Or is this just a temporary issue and more work will follow?? What's the status of BRP7??
The situation had been deteriorating for several days and had become just about impossible over the last 24 hours or so. It would have been really nice if someone on the staff could have dropped a few lines to warn of these difficulties so that people could have been able to make alternative arrangements.
as far as FGRPB1G, as Bernd said, these are running out of work soon (in about 2 months according to the server status page assuming current production keeps up). it's been trending this way for about a month or two. as the time to completion has been steadily dropping from it's steady ~120 days, to ~90 days, and now around 60 days.
not being able to get work the past week or so is because of the BOINC Pentathlon competition, where some folks were spoofing hundreds of fake BOINC instances (logical hosts) to hoard as many tasks as possible. this has been going on for a couple weeks in anticipation of the event. I've had a lot of trouble keeping my big systems fed with enough work. the problem seems to be that the einstein project servers can only distribute work at a set maximum rate, and the work send buffer is getting exhausted when there are so many hosts constantly asking for work (that's why sched requests were only giving 0-5 tasks per request instead of a slog). each host can only ask for work every 60 seconds, and if your host is fast enough to complete work faster than the project can supply it (like mine are) the cache stays permanently depleted, and strangling it's production to whatever rate the server can send work. at first it was really only affecting me, since I'm the only person with hosts fast enough to be affected (>10M daily), but as the competition ramped up for Einstein officially, it got bad enough to affect everyone.
people are starting to get work again because the competition is coming to a close in about a day and everyone is ramping down. they've reported the bulk of their work and aren't constantly pinging for more anymore.
I checked in two hours ago to find my two most productive machines to have reduced their depleted remaining buffer by half overnight so quite likely to run out of work within a few hours. But within a half hour there was a dramatic change, with an end to routine timeout or failure of most communications, and a supply of new work well above my consumption rate.
I only run FGRPB1G.
If it is not all changes in the $!@$#@! Pentathlon depradations, then perhaps some system configuration got modified, as the transition seemed rather abrupt.
Thanks, Gary.Indeed
)
Thanks, Gary.
Indeed "large" is a good idea. I'm a bit reluctant to use "x8" or sth, because we may want to adjust the "bundle size" later.
I updated the 'fpops_estimation' of the old workunits - indeed that makes perfect sense for the additional replications necessary because of the tasks erroring out.
Thanks a lot!
BM
Hi Bernard, I noticed
)
Hi Bernard,
I noticed that all of the latest Arecibo, large work units went into High Priority mode.
Has the deadline for the longer work units been adjusted?
I'm not worried about missing credit but the 2-3 day deadline on long run work units does mess up the BOINC scheduler for crunching other projects when the estimated run time is over 2 days per unit.
LumenDan wrote:Has the
)
The deadline for Arecibo, large is 7 days - probably because they want the work back quickly. It's been that right from the start so I don't think that will change.
I took a peek at one of your hosts - the one with the RTX 3060. You don't have a large number of outstanding tasks but you seem to be running every single EAH search possible so the single DCF (duration correction factor) that Einstein uses for all searches is going to play havoc with the estimates for different searches and give a much increased risk for BOINC to decide to use panic mode when it's not really needed.
You should consider reducing the number of different EAH searches you contribute to, perhaps choosing ones where the estimates are as close as possible to reality in order to reduce fluctuations in estimates as the single DCF gets yanked around.
If you really want to keep all searches (some are going to drop out and there's a new BRP7 search coming shortly) the best way to stop BOINC going into panic mode is to make your work cache setting extremely small - 0.1 days should be good enough to do that - until things settle. Just realise that DCF fluctuations will drive up some estimates from time to time but the low number of tasks on hand should not cause BOINC to panic.
I realise that such a low work cache setting might not be appropriate for other projects you might be supporting so the only alternative is to get rid of EAH searches that are causing the DCF swings. By comparing the estimates you see with the actual crunch times when completed, it should be obvious which searches are causing the problem.
Cheers,
Gary.
Gary Roberts wrote:you seem
)
Thanks for the response Gary,
It looks like a few events coincided to cause my BOINC scheduling issue, now that I have had a chance to work on some of the supposed 2 day runtime units I can see that it was just a gross over estimate.
At the time of writing the last post I had received three separate batches of Arecibo, long units with deadlines 9th, 10th, and 11th of May. The work units due on the 9th were predicted to have a runtime of 8 hours, the newer work units had an estimated runtime of over 48 hours. In reality all of the Arecibo, long work units have taken under 8 hours and the DCF is starting to reduce for the remaining units.
Most of the time BOINC doesn't request CPU work units from Einstein@home on this PC because the project gets enough credit share from the GPU work units. A work shortage from several CPU based projects at once had left the work cache low so BOINC requested CPU work from Einstein@home as well, then panicked at the over estimated run time.
I temporarily suspended a subset of the Arecibo work units to free up a couple of work slots for other projects while the backlog cleared and now everything is running normally again.
I also reduced the additional work cache from 0.5 to 0.3 so that I am less likely to get so much work at once from any project.
I'm glad you have everything
)
I'm glad you have everything back to normal.
Thanks for letting us know!
Cheers,
Gary.
Bernd Machenschalk wrote:...
)
Looks like that may have just happened - at least it has for FGRPB1G.
For the last few days, it's been a struggle to get work and of late for the scheduler to accept completed work when reported. Most scheduler requests of any sort have timed out with various sorts of error messages. Hosts have run dry, and then got a trickle and quickly run dry again. Multi-hour work request backoffs have been the norm. The forums have been pretty well non-functional - many pages failing to load.
As Tom has discovered recently, all of a sudden the forums are back and so are lightning fast scheduler responses. Any completed tasks are immediately gobbled up and no replacements are issued. Responses that timed out after a couple of minutes are now suddenly completed in seconds. Are these FGRPB1G tasks now finished?? Or is this just a temporary issue and more work will follow?? What's the status of BRP7??
The situation had been deteriorating for several days and had become just about impossible over the last 24 hours or so. It would have been really nice if someone on the staff could have dropped a few lines to warn of these difficulties so that people could have been able to make alternative arrangements.
Surely that's not too much to ask??
Cheers,
Gary.
as far as FGRPB1G, as Bernd
)
as far as FGRPB1G, as Bernd said, these are running out of work soon (in about 2 months according to the server status page assuming current production keeps up). it's been trending this way for about a month or two. as the time to completion has been steadily dropping from it's steady ~120 days, to ~90 days, and now around 60 days.
not being able to get work the past week or so is because of the BOINC Pentathlon competition, where some folks were spoofing hundreds of fake BOINC instances (logical hosts) to hoard as many tasks as possible. this has been going on for a couple weeks in anticipation of the event. I've had a lot of trouble keeping my big systems fed with enough work. the problem seems to be that the einstein project servers can only distribute work at a set maximum rate, and the work send buffer is getting exhausted when there are so many hosts constantly asking for work (that's why sched requests were only giving 0-5 tasks per request instead of a slog). each host can only ask for work every 60 seconds, and if your host is fast enough to complete work faster than the project can supply it (like mine are) the cache stays permanently depleted, and strangling it's production to whatever rate the server can send work. at first it was really only affecting me, since I'm the only person with hosts fast enough to be affected (>10M daily), but as the competition ramped up for Einstein officially, it got bad enough to affect everyone.
people are starting to get work again because the competition is coming to a close in about a day and everyone is ramping down. they've reported the bulk of their work and aren't constantly pinging for more anymore.
_________________________________________________________________________
I checked in two hours ago to
)
I checked in two hours ago to find my two most productive machines to have reduced their depleted remaining buffer by half overnight so quite likely to run out of work within a few hours. But within a half hour there was a dramatic change, with an end to routine timeout or failure of most communications, and a supply of new work well above my consumption rate.
I only run FGRPB1G.
If it is not all changes in the $!@$#@! Pentathlon depradations, then perhaps some system configuration got modified, as the transition seemed rather abrupt.
and now FGRPB1G tasks to send
)
and now FGRPB1G tasks to send queue is down to 0.
_________________________________________________________________________
Ian&Steve C. wrote: and now
)
Perhaps getting steadily replenished--I got about thirty new tasks in three gulps after you posted that message.