Some Light on the Possible Causes of the Server Problems

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118639273923
RAC: 18453488
Topic 192349

Normally I don't pay a great deal of attention to the data usage that I get billed for as a result of participating in the EAH project. It's one of the consequences of participating and I fully accept that. However, I've just noticed something that is rather disturbing and I'm drawing it to the attention of any project staff who keep an eye on these boards. I'm doing it publicly in case there are other participants in a similar situation who hadn't noticed it yet.

I do have a large number of machines sharing a single ISP account. As of last October, before any server issues became apparent, my daily data usage was around 200MB on average - 6427MB for the month. I'm quite happy to regard this as normal usage and it is in line with what had been experienced in previous months. Usage had been growing a bit but so had the number of machines.

I believe that the first significant server issues occurred in November and it was also probably around this time that long results started to be replaced significantly by shorts. My data usage for November was 9730MB. There were some "heavy usage" days late in November - between 500 and 800MB per day. I didn't notice it at the time but I do recall seeing an apparent increase in the number of requests for new work which led to the downloading of new large data files and being a little surprised at how frequently I would, by chance, see this happening.

The server problems really took off in December and so did my data usage. I've just seen the bill for this month and my usage was 26,700MB. In late December, I actually started turning off machines (for other reasons) and by the end of the month, the number had reduced from close to 200 machines to around 65. There were some wild swings in data usage, eg 2108MB on Dec 22 down to 200MB on Dec 29. Undoubtedly these are examples of when the project was up and then down respectively.

I've now just gone through my daily usage for January, for which I'll receive a bill in February. Here are a couple of examples:-

January 05 - 1036MB
January 06 - 82MB
January 07 - 33MB
January 08 - 798MB
January 09 - 453MB
...
January 13 - 1240MB
January 14 - 1618MB
January 15 - 2458MB
January 16 - 3267MB
January 17 - 18MB
January 18 - 1203MB
January 19 - 369MB
January 20 - 84MB
January 21 - 54MB
January 22 - 63MB

When you look at those numbers it's pretty easy to see the days when the server was down, like Jan 6/7 for example. If I remember correctly the server was working "normally again" on Jan 8/9 and the data usage looks to be not too high as well.

As the project struggled to get "across the line" to finish S5R1, look what happened on 14/15/16. On Jan 17, we couldn't report and couldn't get new results so the usage died completely. Jan 18/19 was when S5R1 finished and S5RI commenced. Jan 20/21/22 represented days when S5RI was crunching normally.

No wonder the server wasn't able to cope, having to dish out and account for all that data.

The number of my client machines hasn't changed at all in Jan (about one third of what I was using in December), yet the data usage dropped from 3267MB to 60MB per day at the change of data run. No wonder the same server is now coping fine.

It seems to me that there should be a strong priority to avoid the potential for this sort of overload in the future. I may be wrong but it seems much more than just the difference between a mix of long and short result (ie what was normal) and the "all shorts" (abnormal) situation that developed in the last few weeks. I know there will always need to be a "cleanup" of leftovers near the end, but this seems to be way too big a "cleanup" pointing to the need to get a better mechanism to handle the dishing out of the work in the first place so as to avoid such a massive cleanup.

I don't really have any answers on that "better mechanism" but I'd sure like some sort of option where I could declare that a certain machine has "high availability", so give it a data file it can crunch on for quite a while without having to change too often. I saw many examples where a machine requesting a cachefull of new work after an outage would be given up to 5 new large data files (ie 5 x 15.5MB) in order to get 10 - 20 new results. I'd have been much happier to see the server give out just one large data file and have the client get all its work from just that one file.

Please don't interpret any of this as a rant - it's certainly not intended that way. I'm just trying to start a discussion to see if others think that the data issuing policy is sub-optimal and to see if anybody has any bright ideas on how to improve things for the next time we get towards the end of a data run.

As a final point indicating how much worse the end of S5R1 was compared to the end of the run that came before it, here is some monthly data usage that spans the previous transition. That transition occurred in June 2006 I believe:-

April 2006 - 3436MB
May 2006 - 8533MB
June 2006 - 7654MB
July 2006 - 3490MB

A very much smaller spike that time.

Cheers,
Gary.

history
history
Joined: 22 Jan 05
Posts: 127
Credit: 7573923
RAC: 0

Some Light on the Possible Causes of the Server Problems

Gary; I still claim the record at 11 each 15.5mb files and their associated grids downloadeded to one machine in a single session (see previous post "170mb"}. Your analysis is most welcome and on point. If I were to add anything it would be that the mathematics of server overload seems to have escaped the project coordinators.
This is probably a result of the pesky string theorists. Anyway, due to the
miscalculations (or lack of any calculations), the cat nearly died. My thanks for your thoughtful and accurate analysis. It sure was hell week on my little farm.

Regards-tweakster

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118639273923
RAC: 18453488

RE: Gary; I still claim the

Message 59853 in response to message 59852

Quote:
Gary; I still claim the record at 11 each 15.5mb files and their associated grids downloadeded to one machine in a single session (see previous post "170mb"}.

That's a record that I have absolutely no intention of attempting to better :).
You are most welcome to keep it for as long as you like :).

The 3267MB that my boxes downloaded on Jan 16 represents over 50MB for every single machine. Most of my machines are slow by today's standards (PIII vintage) so each box doesn't do that many results in a single day (12 - 16 shorts per day). I really didn't monitor closely what was going on at the time - it's only now that I've become aware of the enormous data usage. The example I quoted was just one I happened to see and at the time. I just considered that it was probably atypical - nothing to get concerned about. There were probably much worse examples if I'd cared to look more closely.

Quote:
Your analysis is most welcome and on point. If I were to add anything it would be that the mathematics of server overload seems to have escaped the project coordinators....

Thanks for your comments. I'm sure that Bruce & Co are well aware of the "mathematical details" of the overload. I'm fully prepared to acknowledge that it may be quite difficult to devise a full solution quickly, particularly with everything else that must be on their plates at the moment. I'm just hoping that someone may have some bright ideas.

For example, I have X machines capable of doing Y results per day. Give me (and a suitable quorum partner) a data file for each machine that will last me for two (or more) weeks. I'll upload regularly (say every 24 hours) and ask for more when that lot is finished.

Obviously this would be wasteful for "transient" participants.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2990746340
RAC: 701987

RE: .... For example, I

Message 59854 in response to message 59853

Quote:
.... For example, I have X machines capable of doing Y results per day. Give me (and a suitable quorum partner) a data file for each machine that will last me for two (or more) weeks. I'll upload regularly (say every 24 hours) and ask for more when that lot is finished.


The trouble is, this is effectively what we've been doing for the last six months, and the servers have proved that they are fully stable under this sort of load.

On the other hand, there will always be occasional WU dropouts - transient users, sure, but also hardware failures, overclocking experiments, software upgrades etc. etc. The whole point of BOINC is that it should be able to cope with these minor glitches.

Most of the errors will affect a single WU or a short run of WUs. What seems to have been the problem this time is that a lot of these little holes in the dataset have been left to one side, to be filled in later. 'Later' arrived, and we know - with hindsight! - the consequences.

For the future, I suggest that the 'work send' policy:

a) sends a higher proportion of 'short' WUs first (assuming that S5R2 still has the split between long and short), and
b) tries to identify gaps in the dataset, and fill them incrementally during the early stages of the run.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118639273923
RAC: 18453488

Richard, I pretty much

Richard,

I pretty much agree with all you have said but would like to add a few more points.

Yes, for the first 90% of the just completed run, the servers were fully stable but I still saw many instances of machines crunching results from several data files at once during that period. A box requesting more work would quite often be given a new data file rather than more results from a pre-existing one. A subsequent request would return to the original data file - go figure. I would like to see this not happen as much.

Because of this apparent policy to have several data files on the go at any one time, the number of individual machines all crunching on that same data file would seem to be larger than it really needs to be and the potential to be left with small pockets (dregs) of results to clean-up later would seem to be higher.

As we get closer to the end of a run, the servers seem hellbent on giving the dregs to the first box that comes along. I saw examples of a new large data file being downloaded for a single result. If there were some mechanism to force the scheduler to hold back (up to a week if necessary) on those cases where there was just a couple of results to go until a box that already had that data file asked for work, some of this huge spike in data downloading might be avoided. The project has got to wait (and possibly reissue non-returned work) for maybe up to a month now before every last gap is plugged. Seems to me that there is plenty of time to be more frugal with dishing out the dregs.

Quote:


For the future, I suggest that the 'work send' policy:

a) sends a higher proportion of 'short' WUs first (assuming that S5R2 still has the split between long and short), and
b) tries to identify gaps in the dataset, and fill them incrementally during the early stages of the run.

Certainly agreed, although I suspect future runs may well be structured quite differently to the previous "long" & "short" model. Also I would add a third and fourth point:-

(c) attempts to minimise the number of machines being given a particular slice of the dataset so that each machine has a longer opportunity to work without requiring further data downloads, and
(d) holds back on the issuing of a datafile to a "new" machine (instead of an "existing" one) where there are fewer than X results needed from that datafile.

X could have a value something like 5 for example.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325479808
RAC: 70553

Here's some of my musings. I

Here's some of my musings. I think, Gary, you've got a good hint right here:

Quote:
I have X machines capable of doing Y results per day. Give me (and a suitable quorum partner) a data file for each machine that will last me for two (or more) weeks. I'll upload regularly (say every 24 hours) and ask for more when that lot is finished.
Obviously this would be wasteful for "transient" participants.

- the total population of machines enjoined in the project is a large number.

- the 'engagement' amongst that population is wide.

- the capacities to calculate ( MFlops etc ) is quite varied betweeen each.

- the connectivity characterisitics ( network availability etc ) is similiarly mutable.

From the server side, when it 'looks' at a machine it is seeing behaviour which is an amalgam of characteristics - not just this Prescott core, with that clock speed, X amount of RAM etc - but a creature resulting from a summation of factors, hardware features, the particular selection of options per the BOINC profiles etc.

However there is an over-arching historical aspect here in the sense that the project could further refine the identify/behaviour of a given machine as time passes, WU's come in etc. This is/will be derivable as some calculable metric(s) from the logs that are generated in the usual way - say 'average time to WU return' or somesuch. Alas periods of server difficulties could invalidate conclusions here.

Thus you could potentially have a 'buddy matching' system to link sufficiently similiar beasts together as computational partners/groups which are hence issued WU's deliberately as part of a common quorum. Like goes with like. This is a layer of analysis over and above the current distribution model. This now stratifies the whole population of machines in a new-ish way. As we are including a timeline/history in our machine characterisations, then there is fluidity in who goes where.

So why would one bother doing this analysis/method at all? I'm not replete with hard data to back this up, but my gut feeling is that by removing 'mismatches' from quorums ( quora? quoramina? ), it would help reduce the clean-up/fiddly factor and perhaps have other effects too...

You wouldn't have to tackle the whole shebang at once, say only pick out certain machine types at first - so that Gary's farm machinery would be matched with others out there in a similiar 'niche'. From a given machine's introduction to E@H, time would have to pass in order for that to firm up. I'm deliberately avoiding any emotive language here like 'reliable' etc, to not appear to give offense or laying out a value judgement. I'm simply thinking of project efficiency. But I think if one links ( within quora ) machines that are 'regular attenders' with those that are 'occasional' then we are wasting capacity - as witnessed by the end-of-run cleanup.

There is much more data, and modes of analysis, to come down the pipe. E@H has a bright future as the LIGO's improve, and other IFO's join. The more 'herd' efficiency the better!

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118639273923
RAC: 18453488

Hi Mike, Thanks for (as

Message 59857 in response to message 59856

Hi Mike,

Thanks for (as always) an interesting, informative and well set out range of ideas.

Quote:
... in the sense that the project could further refine the identify/behaviour of a given machine as time passes, WU's come in etc. This is/will be derivable as some calculable metric(s) ....
.... This is a layer of analysis over and above the current distribution model.

I'm always a bit nervous about adding to the complexity of what the servers have to do. Firstly there would be a real manpower issue to code up and maintain that extra intelligence. Secondly there would potentially be a significant increase in server load to implement the additional intelligence.

Quote:
... my gut feeling is that by removing 'mismatches' from quorums ( quora? quoramina? ), it would help reduce the clean-up/fiddly factor and perhaps have other effects too...

My gut feeling as well. My latest thoughts are that it might be easier to generate a simple "opt-in" questionnaire type approach rather than try to calculate how best to exclude unsuitable boxes. Experience shows that the bulk of participants don't delve deeply enough to bother with a lot of preference settings/questionnaires - particularly ones labelled as "optional" or for "advanced users only". These settings could be in the form of questions, the answers to which would indicate if the user had suitable "high availability" machines that could participate in some form of quorum matching or structuring.

If the questions were chosen carefully and scored, there would be a simple parameter that the scheduler could use to modify the work send policy. People who didn't opt in would still get their data as they do now. I do have some ideas on what those questions might be. If anybody is interested, I could list a couple here with reasons why I think they might be suitable for the task.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325479808
RAC: 70553

RE: I'm always a bit

Message 59858 in response to message 59857

Quote:
I'm always a bit nervous about adding to the complexity of what the servers have to do. Firstly there would be a real manpower issue to code up and maintain that extra intelligence. Secondly there would potentially be a significant increase in server load to implement the additional intelligence.


Absolutely. There are no doubt other show-stoppers and natal-throttlers too.. :-)

Quote:

My gut feeling as well. My latest thoughts are that it might be easier to generate a simple "opt-in" questionnaire type approach rather than try to calculate how best to exclude unsuitable boxes. Experience shows that the bulk of participants don't delve deeply enough to bother with a lot of preference settings/questionnaires - particularly ones labelled as "optional" or for "advanced users only". These settings could be in the form of questions, the answers to which would indicate if the user had suitable "high availability" machines that could participate in some form of quorum matching or structuring.

If the questions were chosen carefully and scored, there would be a simple parameter that the scheduler could use to modify the work send policy. People who didn't opt in would still get their data as they do now. I do have some ideas on what those questions might be. If anybody is interested, I could list a couple here with reasons why I think they might be suitable for the task.


Nice one!! You'd wind up with a sort of 'semi-contract' with some users above the current modes. The choice of derived parameter would NOT, I think, be RAC or any obvious derivative of it - but it's got that general flavour of 'utility'.

Interestingly the best match for any given machine ( seen from the viewpoint of the server end at the other end of the Internet ) is, of course, that machine itself! :-)

This rather negates the purpose of quorums being a ( semi- ) orthogonal check on a particular WU/algorithm/dataset. One would hope then that any non-random quorum partner matching system would not then introduce some subtle and as yet unknown adverse correlation or skew in the result production. This is a devil's advocate view. They may be some deep characteristics of machines ( when viewed in their context ) that would produce that. I recall long ago a discussion of 'random number' generation algorithms ( ? Scientific American ~ 25 years ago ? ): the author chose 10 distinct and some quite apparently bizarre methods of generation, and a system of concatenation of those methods to produce so-called random sequences. Within 5 minutes of it being run it had converged to a small set of indefinitely repeated but alternating numbers!! No matter what seed he used it wobbled for a while then came home to this 'Momma' sequence! The running of it had revealed deep mathematical connections in the algorithm set that would never have been obvious otherwise.

One initial foray that might be useful, is in applying some variety of 'data-mining' techniques to the existing logs to see just what sort of usage/interaction patterns or behaviours can be elicited. I expect this would definitely be a non-trivial exercise as I would liken it to tracking specific fishies in a shoal, as they all swim around the ocean!! The degrees of freedom/axes might be a tad daunting. Perhaps it need to be set as a homework problem!! :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.