'Oldest Unsent Result' is now showing at 9 d 14 h 53 m, which is definitely above the long-term norm - it usually caps off at 7 days. I can't say whether the project is generating too much work, or the clientele are collecting too little of it - either way, it's out of balance somehow.
I'll speculate that something has altered the dynamics of WU issue the current work to make greatly delayed first partner issue much more common than in the past. Berndt has, I think, mentioned a deliberate change to a "more random" pattern of issue. More speculatively, perhaps the ATLAS sudden appearance followed by almost equally sudden reduction has someone put some transients in the system.
This is the actual (quite brief) statement that Bernd made. Here is the key bit
Quote:
Random distribution is how it's designed to be.
So actually it's not a "deliberate change" to a more random distribution but rather the removing of those constraints which caused the scheduler to distribute work in a less random fashion (ie from the bottom frequency up in relatively small steps) during the previous run. In this new run there are no such constraints on the frequencies that the scheduler is now able to distribute and so we see virtually the complete range being distributed. Personally, I happen to think that the previous way of distributing work was more efficient and that some form of those previous constraints should be reapplied :-). This is only a personal opinion and certainly not an admission that the scheduler has suddenly gone "off the rails" and that something needs to be "fixed" :-).
You are quite correct to say that something has altered the dynamics. That "something" is essentially two-fold IMHO. Firstly there are probably an order of magnitude more datasets in play at the moment compared with the average number in play during S5R3. I base that on the fact that on my farm I can see data from sub-100HZ frequencies (which I see occasionally as resends) to above 1100Hz. I've been steadily converting many Windows hosts to Linux so each host receives a new ID and new data. A typical initial download will often have several resends somewhere in the low frequency range (say 100 - 300Hz) followed by a more long lived dataset somewhere between 300 - 1100Hz+ and seemingly relatively at random. This is based on the observation of around 30-50 conversions in the last several weeks.
Secondly, there are significantly fewer "active" hosts at the moment. The server status page says around 62K at the moment. My memory says there were close to 80K in the latter stages of R3 but maybe my memory is faulty. With the significant credit reduction for R4, a lot of hosts have drifted back to other projects, I would guess.
The combination of many more frequency steps currently in play and fewer hosts available to spread between those frequency steps simply means that it's impossible for the scheduler to keep the initial task distribution balanced. A single fast multicore machine is going to race away with the _0 tasks and even several slower single core hosts aren't going to be able to keep up on the _1 task stream. If the scheduler had less frequencies to manage and more available hosts it could do a much better job of keeping the two queues in sync.
I believe there is a third factor that is having an affect as well and this agrees with your comments about ATLAS. It is my distinct impression that there has been a higher proportion of resends being distributed in the recent past, far higher than at a similar stage of previous runs. No doubt, if a large number of hosts are ramped up for a while and then ramped down without exhausting the cache, this could add to the resends. So could the action of a significant number of people deciding to switch projects as a result of the credit shock. Sudden changes in available host numbers like this could impact adversely of the ability of the scheduler to keep the task streams even remotely in sync.
So, to all those out there who think there is an easily fixable problem, please try to consider that the scheduler is simply acting as designed. It's operating differently now, because the operating environment is now substantially different from what it was.
So, to all those out there who think there is an easily fixable problem, please try to consider that the scheduler is simply acting as designed. It's operating differently now, because the operating environment is now substantially different from what it was.
That makes a lot of sense. And it probably explains why This WU from August 19th not only waited the alotted time before it timed out but then the 14 days that were wasted before it was resent again.
Losing 20,000 hosts would make a significant change. I wonder if the extended times between reporting and being granted credits is also a factor in losing hosts besides the credit reduction and longer work times?
Losing 20,000 hosts would make a significant change. I wonder if the extended times between reporting and being granted credits is also a factor in losing hosts besides the credit reduction and longer work times?
Well, I think there is some question as to actual host loss.
The problem is that hosts are counted only if they have reported a result in the past seven days. Not bad by itself, but if the average time to complete a result goes up substantially, and if a large number of hosts contribute at a rate for which that change moves them to reporting less than once a week, then you get a host count change without any real change at all.
So, first question: did CPU time to complete a result go up? Yes.
Next question: does the low tail of project participants contain lots of hosts contributing in the range of roughly one result every 1.1 to 4 weeks?
That would be an Einstein RAC on the current project of something like 27 down to 8.
As of now, relying on the BOINCstats representation of the project's XML, this would range from host 55542 down through host 75542. A glance at the monthly column in BOINCStats suggests that the great majority of these hosts have actually reported something in the last month, but many, many not in the last week.
I don't doubt the "discouraged user" effect exists, but suspect that attributing more than a small fraction of the host count change to that is wrong. The long tail of low contribution hosts adds a lot to the count, but little to the output, and is hugely sensitive to aliasing from the 7-day accounting criterion.
I suspect the discouraged users tend to be much higher up the RAC ladder, far fewer in number than 20,000, but possibly significant in contribution loss.
Now, my numbers are sloppy, and the perfect match to 20,000 is entirely spurious, but I'm quite sure that average CPU time for WU completion interacts strongly with reported active host count.
The sign-on/sign-off behavior or nodes of the different grids and clusters seem to have a distorting effect on the statistics (and may even be responsible for many resents and deadline misses as Gary has mentioned above), Bernd told me they are investigating this.
And as noted in the S5R3 countdown thread in the cafe, there seem to be a significant number of (mostly anonymous) S5R3 Power Users who haven't upgraded their app_info.xml files to accept S5R4 work. Maybe this was deliberate (thinking that S5R3 pays more credit) - in which case they're now well into "shooting themselves in the foot" territory, because there won't be any more S5R3 work issued: or they've forgotten / walked away from the Power User installations - in which case, we've lost them until they wake up and smell the (S5R4) coffee.
The sign-on/sign-off behavior or nodes of the different grids and clusters seem to have a distorting effect on the statistics (and may even be responsible for many resents and deadline misses as Gary has mentioned above), Bernd told me they are investigating this.
CU
Bikeman
Sounds like a re-run of some of the issues we had with RiversideCityCampus. They seem to have dropped out of the picture now, but IIRC Bruce Allen had some personal involvement in setting up that scheme and the Python scripts they used.
... there seem to be a significant number of (mostly anonymous) S5R3 Power Users who haven't upgraded their app_info.xml files ...
I don't think it's "Power Users" at all. For starters, whilst it is doable with a lot of effort, how do you convince the scheduler to keep sending you vast numbers of R3 resends for vast numbers of different frequency bands? Those bands have to be all available on your host and specified in complete detail in your state file. Answer is - you can't - unless you have access to all the frequency bands and can insert that fact into your state file.
Whilst it is possible to go find and download the huge quantities of data needed and then to edit in the fact that you have this data into client_state.xml (I have done this in a mini fashion just to test that it works - and it does, beautifully), I can't see any Power User going to this much trouble. To me, it has to have been a project initiated exercise.
I currently have 5,000 pending credits. This is at the 66.xx count, so realistically, I'm looking at about 18,000 credits. I have WU's pending from August 31st.
Don't we just love the randomist of the way these WU's are being resent. Here's a gander at my oldest one 4 issues
Nothing like sending it out to someone, they time out, them I get it 3 days later and its done in 2, then its sent out to someone for validation 14 days later who has a client error on it the same day it was issued and then it is sent to someone else 4 days later. I have 3 in row like this and waiting 6 to 8 weeks is "BULL". Oh and the next group just after that hasn't been resent yet!
I don't mind having credits in the bank but this is beginning to look like the colaspse of the financial markets as far as I can tell. I wonder if someone will come in with a $700B bailout plan for this as well?
'Oldest Unsent Result' is now
)
'Oldest Unsent Result' is now showing at 9 d 14 h 53 m, which is definitely above the long-term norm - it usually caps off at 7 days. I can't say whether the project is generating too much work, or the clientele are collecting too little of it - either way, it's out of balance somehow.
RE: I'll speculate that
)
This is the actual (quite brief) statement that Bernd made. Here is the key bit
So actually it's not a "deliberate change" to a more random distribution but rather the removing of those constraints which caused the scheduler to distribute work in a less random fashion (ie from the bottom frequency up in relatively small steps) during the previous run. In this new run there are no such constraints on the frequencies that the scheduler is now able to distribute and so we see virtually the complete range being distributed. Personally, I happen to think that the previous way of distributing work was more efficient and that some form of those previous constraints should be reapplied :-). This is only a personal opinion and certainly not an admission that the scheduler has suddenly gone "off the rails" and that something needs to be "fixed" :-).
You are quite correct to say that something has altered the dynamics. That "something" is essentially two-fold IMHO. Firstly there are probably an order of magnitude more datasets in play at the moment compared with the average number in play during S5R3. I base that on the fact that on my farm I can see data from sub-100HZ frequencies (which I see occasionally as resends) to above 1100Hz. I've been steadily converting many Windows hosts to Linux so each host receives a new ID and new data. A typical initial download will often have several resends somewhere in the low frequency range (say 100 - 300Hz) followed by a more long lived dataset somewhere between 300 - 1100Hz+ and seemingly relatively at random. This is based on the observation of around 30-50 conversions in the last several weeks.
Secondly, there are significantly fewer "active" hosts at the moment. The server status page says around 62K at the moment. My memory says there were close to 80K in the latter stages of R3 but maybe my memory is faulty. With the significant credit reduction for R4, a lot of hosts have drifted back to other projects, I would guess.
The combination of many more frequency steps currently in play and fewer hosts available to spread between those frequency steps simply means that it's impossible for the scheduler to keep the initial task distribution balanced. A single fast multicore machine is going to race away with the _0 tasks and even several slower single core hosts aren't going to be able to keep up on the _1 task stream. If the scheduler had less frequencies to manage and more available hosts it could do a much better job of keeping the two queues in sync.
I believe there is a third factor that is having an affect as well and this agrees with your comments about ATLAS. It is my distinct impression that there has been a higher proportion of resends being distributed in the recent past, far higher than at a similar stage of previous runs. No doubt, if a large number of hosts are ramped up for a while and then ramped down without exhausting the cache, this could add to the resends. So could the action of a significant number of people deciding to switch projects as a result of the credit shock. Sudden changes in available host numbers like this could impact adversely of the ability of the scheduler to keep the task streams even remotely in sync.
So, to all those out there who think there is an easily fixable problem, please try to consider that the scheduler is simply acting as designed. It's operating differently now, because the operating environment is now substantially different from what it was.
Cheers,
Gary.
RE: So, to all those out
)
That makes a lot of sense. And it probably explains why This WU from August 19th not only waited the alotted time before it timed out but then the 14 days that were wasted before it was resent again.
Losing 20,000 hosts would make a significant change. I wonder if the extended times between reporting and being granted credits is also a factor in losing hosts besides the credit reduction and longer work times?
RE: Losing 20,000 hosts
)
Well, I think there is some question as to actual host loss.
The problem is that hosts are counted only if they have reported a result in the past seven days. Not bad by itself, but if the average time to complete a result goes up substantially, and if a large number of hosts contribute at a rate for which that change moves them to reporting less than once a week, then you get a host count change without any real change at all.
So, first question: did CPU time to complete a result go up? Yes.
Next question: does the low tail of project participants contain lots of hosts contributing in the range of roughly one result every 1.1 to 4 weeks?
That would be an Einstein RAC on the current project of something like 27 down to 8.
As of now, relying on the BOINCstats representation of the project's XML, this would range from host 55542 down through host 75542. A glance at the monthly column in BOINCStats suggests that the great majority of these hosts have actually reported something in the last month, but many, many not in the last week.
I don't doubt the "discouraged user" effect exists, but suspect that attributing more than a small fraction of the host count change to that is wrong. The long tail of low contribution hosts adds a lot to the count, but little to the output, and is hugely sensitive to aliasing from the 7-day accounting criterion.
I suspect the discouraged users tend to be much higher up the RAC ladder, far fewer in number than 20,000, but possibly significant in contribution loss.
Now, my numbers are sloppy, and the perfect match to 20,000 is entirely spurious, but I'm quite sure that average CPU time for WU completion interacts strongly with reported active host count.
The sign-on/sign-off behavior
)
The sign-on/sign-off behavior or nodes of the different grids and clusters seem to have a distorting effect on the statistics (and may even be responsible for many resents and deadline misses as Gary has mentioned above), Bernd told me they are investigating this.
CU
Bikeman
And as noted in the S5R3
)
And as noted in the S5R3 countdown thread in the cafe, there seem to be a significant number of (mostly anonymous) S5R3 Power Users who haven't upgraded their app_info.xml files to accept S5R4 work. Maybe this was deliberate (thinking that S5R3 pays more credit) - in which case they're now well into "shooting themselves in the foot" territory, because there won't be any more S5R3 work issued: or they've forgotten / walked away from the Power User installations - in which case, we've lost them until they wake up and smell the (S5R4) coffee.
RE: The sign-on/sign-off
)
Sounds like a re-run of some of the issues we had with RiversideCityCampus. They seem to have dropped out of the picture now, but IIRC Bruce Allen had some personal involvement in setting up that scheme and the Python scripts they used.
RE: ... there seem to be a
)
I don't think it's "Power Users" at all. For starters, whilst it is doable with a lot of effort, how do you convince the scheduler to keep sending you vast numbers of R3 resends for vast numbers of different frequency bands? Those bands have to be all available on your host and specified in complete detail in your state file. Answer is - you can't - unless you have access to all the frequency bands and can insert that fact into your state file.
Whilst it is possible to go find and download the huge quantities of data needed and then to edit in the fact that you have this data into client_state.xml (I have done this in a mini fashion just to test that it works - and it does, beautifully), I can't see any Power User going to this much trouble. To me, it has to have been a project initiated exercise.
Cheers,
Gary.
RE: I currently have 5,000
)
Currently at 6,000 WU credits pending. :)
Don't we just love the
)
Don't we just love the randomist of the way these WU's are being resent. Here's a gander at my oldest one 4 issues
Nothing like sending it out to someone, they time out, them I get it 3 days later and its done in 2, then its sent out to someone for validation 14 days later who has a client error on it the same day it was issued and then it is sent to someone else 4 days later. I have 3 in row like this and waiting 6 to 8 weeks is "BULL". Oh and the next group just after that hasn't been resent yet!
I don't mind having credits in the bank but this is beginning to look like the colaspse of the financial markets as far as I can tell. I wonder if someone will come in with a $700B bailout plan for this as well?