Gary: <snarky>What?! I'm not the most important single factor in your world! Huff!</snarky> :) Good call!
To eliminate ambiguity (sigh!), I seek to equally balance time.
Mike: "A fiat currency no less." Here, here! I seek to be entirely ambivalent to work credits. It's about helping science. Only that. I do it with otherwise-wasted resources.
Mike: Your details of Einstein scheduling are logical, very interesting, and untroubling. My only resulting wish is that I would receive from Einstein either fewer or smaller tasks from within the block of data I have on my hard disk. Even that isn't necessary, as explained below.
Gary: Thanks again, but more deeply now, for suggesting I monitor CPU temperatures. I was heartened that the first couple days of monitoring, the temp seldom went above 80 C, never for long, and was often below 70 C. This morning, it woke up cranky. Core #0 was tending to often exceed 90 C, with other cores only a bit lower. I've dialed SETI back to "Use at most 75% of CPUs" and "Use at most 45% of CPU time", which gives peak and infrequent temps on Core #0 of 80 C (seems acceptable) and lower temps on other cores. BOINC runs 6 tasks (consistent with 75% of 8). The settings also seem to cause highly variable CPU utilization (peaks and valleys, 10% on one one-second reading to 100% on the next) spread equally among all 8 threads (judging by the Windows Resource Monitor; even the shape of the utilization graphs is identical among the eight threads).
If I've had these events before, I've been unaware. Thanks!
If this keeps up, I'll go inside the case to see whether there's a ball of dust somewhere it doesn't belong. This is different behavior than the first days.
I tried TThrottle, advertised to throttle BOINC in response to core temp. It did a calibration run before starting, during which temperatures were steadily above 95 C and sometimes at 100 C for far longer than I was comfortable. I stopped it and uninstalled. The above BOINC controls (though admittedly more crude) are the best I know to do.
Gary: I stopped using concurrency limits because of your supposition that they were contributing to the unexpected task arrivals I was experiencing. I have evidence consistent with your advice. You did me service to ask me to consider it.
Since I started this experiment, I recall only three receipts of Einstein tasks. My notes indicate the first two were Feb 25 and 28 (far too close together). Both were after I started using concurrency limits. Each consisted of 6 tasks each needing around 28 hours. I remember them as consistent with receipts previous to the experiment, but I can't put as much value on those recollections as I can on results while I'm taking notes. During this writing, the third receipt came: 2 (not 6) Einstein tasks (though still of around 28 hours). That's good news. I'll be more reluctant to further consider using concurrency limits. I'll leave them unused for the foreseeable future.
Maybe it resulted from not using concurrency limits.
This occurred after I changed computing preferences due to temperatures; maybe that contributed.
A request for both: Please help me decide whether I've recognized a flaw in the scheduler's handling of my situation before my experiment.
Let's consider any period when a project sends blocks of six tasks, each with 28-hour lengths and three-week deadlines.
Let's assume:
I have eight threads available at 100% capacity.
For the average amounts of time my computer is on, completing those tasks on time requires over half the time I contribute to BOINC during those three weeks.
That's okay with me if my computer goes a period without Einstein tasks after that.
My computer processes that batch of six tasks at the same daily rate throughout the three weeks.
I'll say the 10-day half-life default on scheduler data is too low for the situation. (I'm glad I saw in the configuration files that I can decide on a different value for the half-life (<rec_half_life_days>), but that's another issue and I know both of you respect defaults a great deal. Just saying.)
A characteristic of a half-life is that half of "it" is gone in the half-life. (Examples: radioactive decay and drug persistence after medication.) In this case, we're talking about the effect of past scheduling decisions. (I don't have detailed knowledge and cohesive understanding of the scheduler data, but thinking about this "effect" seems reasonable.)
I suspect the point of the scheduling data is to "remember" decisions like these two: (1) doing too much work for a project, and (2) not doing enough work for a project. In both cases, the scheduler does well if it compensates with future decisions. It cannot do so if it "forgets".
I suspect the design purpose for using the half-life was to "forget" decisions like those at an appropriate rate. Sadly that rate isn't constant for all environments (a reasonable motivation for making it a configuration item).
I suspect that people studied some environment in the past and concluded a 10-day half-life worked well.
With any half-life, a very recent scheduling decision has a powerful effect (call it 100%); we don't want the scheduler to accept/request similar sets of data one immediately after the other. One half-life later, half the effect on current scheduling decisions has decayed (call it 50%). And one more half-life later, half that is gone (25%). It's clear that 100% power is enough to deter accepting/requesting tasks immediately after getting tasks. I don't consider it clear that 25% power is enough.
To the contrary, I can report from my experience that the scheduler frequently accepted/requested more tasks very soon after completing the last task in a batch.
We can't expect the scheduler to make good decisions if the data decays too quickly. Evidence is that it is decaying too quickly for my good.
What's a better half-life for this environment? One candidate answer:
When a big batch of tasks arrives, the sending project needs (for simplification) half of the computer time (1.5 weeks) for the first three weeks. At the same time, the other two get 0.75 weeks. (Total: 3 weeks.)
If no other tasks for that project arrive for another 1.5 weeks, the other two projects each get 0.75 weeks during that time. At that point, all projects have 1.5 weeks of time.
Arguably, the best data on which to make a decision is data collected for the same amount of run time on each project. 4.5 weeks is the earliest point that happens.
Maybe, that's also the best half-life.
After two half-lives, the first results would be decayed to 25% power and the second to 50%. Presuming their results are consistent with each other, the effect has 75% power (the reason you folks so sagely counsel waiting for the scheduler data to catch up). Successive half-lives continue to improve results. If there's a change in the system, it'll work its way through the data eventually. If there's volatility in results, the data will muddle through, encouraging decisions based on middle positions within experiences.
I like that recipe. Maybe that's worth testing.
Maybe, I experienced exactly the behavior described in this thought piece.
I reacted by tweaking resource values. While I tend to think the evidence suggests those tweaks produced better results, I don't think results are solid for lack of data. Perhaps the temperature limits will keep me from collecting more in that environment.
At the same time, I consider it more important to gain data on longer half-lives than to collect more data in the 10-day half-life environment.
Protect me from me, if you will. Where's my thinking improvable?
Garry,
If you are serious about knowing if BOINC is honouring your resource shares or not and, more particularly, knowing the extent of any non-compliance with your resource share settings, here is a suggestion for you.
Take a look in your BOINC data directory for 3 "job log" files. The Einstein one will be called 'job_log_einstein.phys.uwm.edu.txt. The other two will have their respective project URLs in the file name. Here are the last five lines in one of my files, representing the last 5 tasks that were completed by this particular host of mine. I have reduced the font size a little so the lines don't overflow. Each line contains space separated data about each completed task that the host returns.
1583076315 ue 42181.057911 ct 16674.990000 fe 105000000000000 nm LATeah1005F_584.0_74760_0.0_0 et 19380.404697 es 0
1583087341 ue 42181.057911 ct 16659.310000 fe 105000000000000 nm LATeah1005F_664.0_112140_0.0_0 et 19270.846723 es 0
1583095611 ue 42181.057911 ct 16620.390000 fe 105000000000000 nm LATeah1005F_664.0_292440_0.0_0 et 19295.118612 es 0
1583107102 ue 42181.057911 ct 16683.390000 fe 105000000000000 nm LATeah1005F_680.0_118920_0.0_0 et 19759.751162 es 0
1583115714 ue 42181.057911 ct 16764.320000 fe 105000000000000 nm LATeah1005F_680.0_302580_0.0_0 et 20102.229174 es 0
In each line, the first value is the time when the task was returned, expressed in seconds since the 'epoch'. The epoch is midnight on Jan 01, 1970. After that value there are pairs of values - a code followed by a value. Here are the meanings of codes, with the data values as shown in the first line.
ue - the estimated crunch time (eg 42181.057911 secs) - not important for resource share calcs.
ct - the actual CPU time to crunch the task (eg 16674.99 secs) - very important as this is what is used.
fe - the 'flops estimate' (105000000000000) - a measure of the supposed work content.
nm - the name of the actual task that was returned - no influence on anything.
et - the elapsed time taken by the task (eg 19380.404697) - longer than ct if host does other things.
Here's what I suggest you do. At some particular nominated time, write down the details of the very last line showing in each of your 3 files. Wait a decent amount of time (I would suggest a full week) without making any changes to settings. Just let BOINC do its thing. At the end, add up all the 'ct' values for the new lines that have appeared in each of the 3 job log files. You will then know exactly how much CPU time BOINC has allocated to each separate project and you will be able to see how well BOINC is following your resource shares.
Irrespective of how good or bad you think the agreement is, don't touch anything but simply wait another complete period to see what happens. I would be very surprised if BOINC doesn't have things pretty much under control by the end of two full weeks. I have never done this myself but I have confidence that BOINC, left alone, will be able to get pretty close - unless something drastic happens - e.g. Seti out of work for a large slice of the total time.
You need to forget about how many downloaded tasks you see at any single point in time or how many may be running at a particular instant. It's only the completed results that will tell the true story.
... This morning, it woke up cranky. Core #0 was tending to often exceed 90 C, with other cores only a bit lower. I've dialed SETI back to "Use at most 75% of CPUs" and "Use at most 45% of CPU time", which gives peak and infrequent temps on Core #0 of 80 C (seems acceptable) and lower temps on other cores. BOINC runs 6 tasks (consistent with 75% of 8). The settings also seem to cause highly variable CPU utilization (peaks and valleys, 10% on one one-second reading to 100% on the next) spread equally among all 8 threads (judging by the Windows Resource Monitor; even the shape of the utilization graphs is identical among the eight threads).
Garry,
Just a couple of quick points in relation to your recent experience with elevated temperatures. I'm not surprised by your findings of higher temperatures but I am concerned by one of the things you have changed to combat the problem. By all means reduce the threads BOINC is allowed to use. I would suggest 50% until you are sure of how hot things get.
The big no-no in my opinion is to use the setting for % of CPU time. Please realise that using 45% means that the CPUs will run at 100% load for 45% of the time and 0% load for 55% of the time. The change from 0% load to 100% load will occur quite frequently, roughly every second. This will induce thermal cycling in a CPU core which may induce expansion and contraction stresses on delicate structures within a core. You are much better off with more steady temperatures from a reduced number of active cores that aren't thermal cycling like this.
If you can turn off HT so that you are using just 4 'real' cores rather than 8 shared 'threads' you can still use 100% of the cores and you may find that you get a steady, acceptable temperature with each of the 4 running tasks actually running faster than previously so that your work output is not as heavily reduced. If things are still too hot, you could then start dialing down the % of cores to 75% or even 50% if necessary.
I think you may find that using 4 full 'cores' might work out OK. It's at least worth a try.
I suspect that people studied some environment in the past and concluded a 10-day half-life worked well.
Like Mike, I only run Einstein these days. I ran Seti from 1999 in the classic days until the advent of BOINC in late 2004. Einstein opened its doors in Feb 2005 and that's when I joined. During that year, I ran Seti, Einstein and LHC and thought deeply about the scientific goals of each project and took a lot of notice of how each project was being managed, both in terms of the people involved and the resources they had access to. I decided that Einstein was most likely to offer long term reliability and the opportunity to participate in extremely important discoveries.
I mention all this simply to explain why I have no knowledge or experience of REC or of the half-life mechanism which came a lot later and were probably designed around Seti's requirements at that time. I had long since stopped running Seti at that point.
However I will offer this further comment, which is little more than speculation. One reason why BOINC was developed was really to support Seti in an expanding environment of increasing numbers of volunteers at a time when the classic model was falling apart at the seams. Other projects could start and use the same framework and some of the excess load that Seti couldn't handle could be deployed elsewhere and give Seti some breathing space. The BOINC developers were associated with Seti and would have had Seti's longer term goals in mind while spreading the load to new projects.
As a result, the design of BOINC and the changes over time would have had "Seti friendliness" in mind. That's not any sort of criticism - it's very much what you would expect and I have no issue with that. So the bottom line is that since Seti's long term reliability problems have been around for a very long time, it's natural to expect that defaults for any of the mechanisms that have been put in place are likely to be Seti friendly - unless the developers are really incompetent - which I'm sure they're not :-).
So, my strong feeling is not to fiddle with defaults like half-life unless you are really sure that any counter-advice comes from somebody really reliable who absolutely knows better. I would put someone like Richard Haselgrove in that category and if you went to the BOINC boards and asked about REC and half-life there, you would get top advice from him.
My very vague understanding of half-life is that it can change the "rate of change" in how BOINC attempts to rectify a discrepancy. Too long a half-life means that a correction might be happening too slowly. Too short a half-life might cause the corrections to 'overshoot' and perhaps 'oscillate' rather than reach stability. You would have to think that the designer of the system would have put a lot of thought (and perhaps trial and error) into selecting a value that would be good for the way Seti behaves. This is just speculation - I have no evidence.
Gary: Beggin' pardon. In the stream below, there is a preceding message of mine, then three of yours. (Thanks tons for caring so much! Appreciated.) I kept getting interrupted in writing mine, so I wasn't responding to your second and third notes though you sent them before I sent.
You'll note you closed your note with, "I started this reply yesterday. Other things intervened and I wasn't able to finish it then." And I opened with my "<snarky></snarky>" comment (intended tongue-in-cheek).
I'm responding here for the first time to your second and third of those messages.
That said, my theory doesn't support my "assign resources as inverse of credit granted" scheme. You say, "You need to ditch all of that thinking." I agree.
I still have trouble with the fact that when I had equal resource shares, I would typically witness Einstein using six of the eight threads, with the other projects getting one each. And Einstein tasks coming in batches of six 28-hour tasks (168 hours of processing, the same number of hours in a week). And those batches coming in rapid succession.
I describe that as my experience. I don't understand it. I know it conflicts with the common understanding (yours and mine, too) of how BOINC works. And, it's my reality. Do you believe me?
Guaranteed: With the imbalanced resource limits, the time balance was qualitatively better. I didn't operate long enough to have the numbers. They'd bear out, but I don't know how much better. Only, now, I don't know why. Maybe I'll get back to it. The half-life looks more important.
You didn't believe me on my first theory (and you were right; I was off). On the theory about half-life, though? The argument seems compelling that the scheduler doesn't have enough information to know it's time for the next Einstein batch until half-life is far higher than the 10-day one-size-fits-all default so many of us are using.
You don't always drive with your bright lights on, right? Or your windshield wipers? You don't always drive a given route at the same speed or in the same part of your lane, right? Lots of decisions in life are situational. And half-life cannot be?
We're talking, I hope, to problem-solve together.
I'm following your advice. I've said that a couple times. I recognize one acknowledgment of it. I also see concern that I haven't taken your advice.
Consistent with your suggestion, I've been operating with the 30:10:3 resource shares since Sat 2/22, 12 days. And without concurrency limits since Sat 2/29, 5 days.
As to schedulers: I think we're largely in full agreement or discussing issues that don't matter really. We agree that software on the client sends information to the project server and the project server sends tasks. Maybe the result of the collaboration is improvable and we can figure out why.
Here's another thing I find interesting/troubling:
Usage limits
* Use at most N % of the CPUs: Keeps some CPUs free for other applications. Example: 75% means use 6 cores on an 8-core CPU.
* Use at most N % CPU time: Suspend/resume computing every few seconds to reduce CPU temperature and energy usage. Example: 75% means compute for 3 seconds, wait for 1 second, and repeat.
Reading that, I would expect that when I select "use at most 87.5% of CPUs" (7/8), my Resource Manager would show one thread carrying no BOINC load. (Do you use Windows? See the eight graphs, one per thread, showing utilization?)
The text looks crystal clear.
That's not my experience.
This morning, I set for 12.5% of CPUs. All threads were equally busy. They weren't synchronized (same graph shape, as I've seen before). Total utilization: 25%. One BOINC task active.
Then, 25% (2 threads). All threads were equally busy. Total utilization: 50%. Two BOINC tasks active.
37.5% (3 threads). Equally busy. Total utilization: 75%. Three BOINC tasks active. [There are some patterns here!]
50% (4 threads). Equally busy. Total utilization: 95-100%. Four BOINC tasks active.
62.5% (5 threads). Equally busy. A full 100%. Five tasks.
The words of the documentation don't seem to describe that non-linear behavior!
Why did I get to the "45% of CPU time" setting? Because I was seeing reports of temperatures very close to 100 C and I wanted to get processing down immediately. 87.5% of CPUs had no effect (consistent with above). Same for 75%. Same for 62.5%. I tried the other control and temps came down. That lowered the urgency.
I've investigated more. It appears the 50% of CPUs and 100% of CPU time is a great choice, like you said. Four tasks are running; all threads are essentially fully occupied. Maybe that means those four tasks each have roughly twice the processing power they'd have if running on a single thread. Maybe that means the projects get results in about half the time (once a task starts).
Interestingly (maybe not surprisingly), the queue size is much lower. During my intense data collection, I saw queue sizes of 12 and 14 tasks. Since then, I'm seeing more like 6 or 8 tasks. (7 tasks as I write. Later 6, then 5. I've never seen anything like this before. None are SETI tasks, because of the weekly maintenance. And some shorter Einstein tasks are here, around 16 and 18 hours. Two tasks here are Einstein and three are Rosetta. They seem to swap over time who gets the most threads, exactly as we'd hope BOINC would react. When SETI tasks arrive---the most recent was a batch of six---they fly through the system and don't get replaced promptly. You're right. That's not a balancing problem; it's a flow of work problem.)
This is interesting. Maybe you can operate your computer at half its current queue size applying almost all the same processing power and completing tasks in half the time.
Or, maybe the 8 threads of "almost fully busy" I see depicted are (consistent with the "50% of CPUs" setting) only doing half the work. Dunno. Yet. We have now talked through the tools to know, though!
Thanks for decoding the job log files for me. I searched for that info and never found it. Yours will help fine. I suspected Unix epoch data was in there; I could decode it, but not know the meaning. Have both, now.
All the best. Next tasks: Decode everything in the job log files. Excel should handle it nicely.
First of all, carrying on a very detailed and far reaching discussion via a message board and with quite different time zones is difficult for all parties concerned. I have a big 'fatal flaw'. I never use one word when 100 will do :-). I tend to over-explain and thereby perhaps obfuscate the points I'm trying to make. I'm very persistent. As long as the conversation continues, I'll do my best to respond. If a new thought springs to mind, I tend to fire off an extra mesage or three. If responses take time, so be it. I don't stress over delayed replies. There are always pressures from real life outside message boards.
Garry wrote:
I still have trouble with the fact that when I had equal resource shares, I would typically witness Einstein using six of the eight threads, with the other projects getting one each. And Einstein tasks coming in batches of six 28-hour tasks (168 hours of processing, the same number of hours in a week). And those batches coming in rapid succession.
I describe that as my experience. I don't understand it. I know it conflicts with the common understanding (yours and mine, too) of how BOINC works. And, it's my reality. Do you believe me?
I certainly believe it's the reality you think you observe. It may also be the transient reality that BOINC was being forced into by the combination of a number of factors such as:-
BOINC will always fill up the work cache with a non-preferred project if the preferred project can't supply.
BOINC will keep doing that until a reliable supply is re-established for the preferred project.
BOINC will deliberately disregard resource shares if tasks are at deadline risk.
Deadline risk depends on estimated crunch time which could be a lot different to the true time.
BOINC isn't smart enough to keep retrying several times, just in case any lack of supply was only very temporary.
BOINC is compelled to fill the work cache if it's deficient, irrespective of resource share imbalance.
There may be bugs associated with setting maximum concurrent values that contributed to this behaviour.
Unless you spent a lot of time doing regular observations of exactly what was being processed and what projects were being asked for work (and exactly how much work was being asked for) it would be quite difficult to form a really accurate picture of the nature of any departure from expected behaviour.
There is another point here as well. The number of tasks 'waiting to run' and their time estimates can be a very deceptive indicator of the true state of affairs after the tasks are crunched and the real crunch times are recorded. You might think there were always too many tasks for one project waiting to run but the true guide would be the total time of those that were actually completed and returned.
Now that you have info on the content of job logs, you could go back in time to say 22 Feb (a date you mentioned) and count up the true crunch times for all 3 projects for tasks completed after that point. How those three totals compare with each other would be very interesting and might be rather different from what you thought was happening. BOINC does not necessarily have things in balance in the short term so I would certainly think it quite likely that Seti could be behind the others in its share but perhaps not as much as you imagine.
If you're interested, the seconds from the epoch for my time zone (UTC+10) for the start of 22 Feb is 1582293600 secs. You might have to adjust that for the difference between your time zone and mine. I imagine the times recorded in the job log files are local times.
Garry wrote:
We're talking, I hope, to problem-solve together.
Now that concurrency limits are gone, I don't believe there is any problem to solve. If you want to keep exploring REC half-life, I'm truly not the person to do that with. I have zero understanding or experience with it and I have no real way to contribute intelligently to how it may help or hinder the BOINC behaviour in relation to resource shares. I suspect (but don't know) that it was designed to help Seti get its designated share of resources. Now that Seti is closing its doors, I suspect it will be a lot easier for the more reliable projects to work properly without having to worry about adjusting half-life.
Garry wrote:
Here's another thing I find interesting/troubling:
Usage limits
* Use at most N % of the CPUs: Keeps some CPUs free for other applications. Example: 75% means use 6 cores on an 8-core CPU.
* Use at most N % CPU time: Suspend/resume computing every few seconds to reduce CPU temperature and energy usage. Example: 75% means compute for 3 seconds, wait for 1 second, and repeat.
Reading that, I would expect that when I select "use at most 87.5% of CPUs" (7/8), my Resource Manager would show one thread carrying no BOINC load. (Do you use Windows? See the eight graphs, one per thread, showing utilization?)
No, I don't use Windows. I've used Linux for the last 13 years.
If you think there should be one thread showing no BOINC load, you are not understanding how modern operating systems schedule the workload. The first thing is that 87.5% means BOINC is allowed to run 7 tasks on an 8 thread machine. That absolutely does not mean that there will be a thread carrying no BOINC load. BOINC simply says to the OS, "run these 7 jobs". It's entirely up to the OS how that is achieved. You also need to remember (because of HT) you only have 4 real cores. Two threads can occupy the same core and load will show for both threads with just one task running on that core.
At any one time there are a very large numbers of processes that the OS has to schedule. Many are very short term but potentially of higher priority than the compute tasks. Those tasks will be swapped around between threads at the convenience of the OS. In its lifetime, a long running crunch task will probably migrate between threads many times. On average you would expect to see all 8 threads having substantial utilisation and perhaps averaging out rather higher than 87.5%. Those utilisation numbers are never fully accurate so it wouldn't surprise to see all threads showing 100% even if there were less than 8 tasks running.
You would expect some sharing across all threads to happen in much the same way, even with just 1 BOINC task running. The figure you quote of 25% for all threads isn't all that surprising. Nor is the 50% value when there were just 2 BOINC tasks. Both seem a little high but not surprising. Bear in mind that I have no knowledge or experience with how Windows arrives at these numbers.
Garry wrote:
I've investigated more. It appears the 50% of CPUs and 100% of CPU time is a great choice, like you said. Four tasks are running; all threads are essentially fully occupied. Maybe that means those four tasks each have roughly twice the processing power they'd have if running on a single thread. Maybe that means the projects get results in about half the time (once a task starts).
The crunch time will certainly come down quite a bit but not as low as one half. I suggested that particular setting because I felt that the OS should be smart enough not to put 2 BOINC tasks together on the one core but rather have a task per core with the other short lived higher priority OS type jobs able to access the 2nd 'half' of those 4 cores. You will know this is happening if all BOINC tasks seem to be equally reduced in their crunch times compared to when you were running all 8. Someone with Windows experience would be better able to give advice about that. It seems to work that way in Linux, although I don't get much experience since for several years now I've mainly run GPU tasks only.
My very vague understanding of half-life is that it can change the "rate of change" in how BOINC attempts to rectify a discrepancy. Too long a half-life means that a correction might be happening too slowly. Too short a half-life might cause the corrections to 'overshoot' and perhaps 'oscillate' rather than reach stability. You would have to think that the designer of the system would have put a lot of thought (and perhaps trial and error) into selecting a value that would be good for the way Seti behaves. This is just speculation - I have no evidence.
Your understanding isn't vague. It's spot on.
A question before us: Is the default setting causing "corrections to 'overshoot' and perhaps 'oscillate' rather than reach stability"?
I could nominate repeated inappropriate receipt of huge batches of tasks as improvable decisions characteristic of half-life being too short.
On the other hand, with this new configuration of "use at most 4 CPUs", I'm seeing very different queue sizes, even within the first two days. I accept the likelihood that prior data is not a solid basis for assessing this configuration. I set this configuration yesterday. The first half-life might demonstrate early data of interest.
Perhaps for this configuration, the decisions and the default half-life will prove fine.
Once again, I wrote without knowing you'd sent a message. I thought I checked. If it happens this time, I don't know any technique to check!
Gary Roberts wrote:
I have a big 'fatal flaw'.
Well, thanks for that. You make it sound like self-criticism. Likely many observers of our conversation would send "a plague on both their houses". For a concept that needs 10 words, both 5 and 20 words are improvable. I aim to use 10. Others know more techniques for communication, no doubt.
Gary Roberts wrote:
I certainly believe it's the reality you think you observe. It may also be the transient reality ...
True as to transient. I am frequently using "maybe" with these assertions for your reasons (hopefully "always", not "frequently"; I'm human).
* The scheduler hasn't served me well (no maybe).
* I thought I could compensate with changes to resource values. (Abandoned. My approach wasn't good; I don't know whether this is true).
* Recently, my computer has received too few SETI tasks to balance time among the projects (no maybe). This hasn't always been true.
* The scheduler reacted to my abandoning concurrency limits (no maybe). (Probably, I should word this as a suspicion, recognizing the time between then and using "use at most 50% of CPUs" was short. I view the change as pronounced.)
* The scheduler reacted to my change to "use at most 50% of CPUs" (no maybe).
* Maybe, my computer is almost as busy now (at "use at most 50% of CPUs") as before. Suspicion; want numbers.
* Maybe, the 10-day half-life is too low for my environment. (Honestly, I'm confident of this based on past experience and understanding, and on previous environment (using "at most 100% of CPUs"). And I want to prove it to myself with data. This may not be true of "use at most 50% of CPUs".
* Maybe, undetected temperature events are in my past. Conjecture. Wishful thinking.
* It's less likely I'll have relevant temperature events in the future, due to using the "overheat protection" feature of the temperature monitoring software I installed. I have you to thank (no maybe).
* Casually watching the mix of tasks, say, four times a day for to months and observing that SETI tasks typically have one or two threads when they run while other projects often have six threads for long periods is adequate observation to know the scheduler isn't performing well (no maybe). Starting more careful note taking to collect real data is necessary to convince others (perfectly reasonable).
Gary Roberts wrote:
The number of tasks 'waiting to run' and their time estimates can be a very deceptive indicator of the true state of affairs
True. We haven't discussed this. Maybe, all projects tend to run longer (say, 120% of the early estimates; perception and not carefully measured). And, batches of six 8-hour SETI tasks affect my computer less than batches of six Einstein 28-hour tasks (no maybe, even if only on observation).
Gary Roberts wrote:
I suspect (but don't know) that it [half-life] was designed to help Seti get its designated share of resources.
Its theory is well established and not project-specific in this usage. If it was intended to favor SETI unfairly, it backfired in this case (I don't favor the "SETI favorite theory" it's a valid technique for sharing among projects; as to backfire: no maybe; observation is sufficient). The short task times for tasks SETI sends match well with a 10-day half life (my experience with half-life). The long task times I was getting from others can overwhelm projects with short task durations (my experience with half-life).
Gary Roberts wrote:
If you're interested, the seconds from the epoch for my time zone (UTC+10) for the start of 22 Feb is 1582293600 secs.
Whoa! Australia? New Zealand? Indonesia? I love the Internet! Personal question. No need to answer. None of my business.
Come to think of it, I'd seen European-influenced spellings in your messages (eg., "behaviour"). I used to very distracted. No more. There's more than one "normal" (including other languages, of which I have none).
I'm in Nebraska, near the center of the US "lower 48" (the states other than Alaska and Hawai'i).
Gary Roberts wrote:
The first thing is that 87.5% means BOINC is allowed to run 7 tasks on an 8 thread machine. That absolutely does not mean that there will be a thread carrying no BOINC load.
(he he) There's a great deal of depth to modern operating systems. I know more about operating systems than to validate your suspicions. I also know what "Use at most N % of the CPUs: Keeps some CPUs free for other applications. Example: 75% means use 6 cores on an 8-core CPU." means (emphasis mine). It's just wrong. It doesn't describe now. I thought the page was older, but that page was last modified in 2019; it doesn't describe then. But yes, I should have recognized it right off. But it's a side point ...
What? A Linux guy calling Windows a "modern operating system"? Doesn't that violate the conventional wisdom of the Linux Enthusiast Club? Are you risking expulsion? :)
But honestly, both OSs have come a long way since the rivalry was stronger. Still, an overwhelming percentage of the developers I've worked with (not that I claim to be one) have wanted Linux for their development computer. Now, many are saying MacOS is the best platform for all three. But I digress ...
Gary Roberts wrote:
I've mainly run GPU tasks only.
I've run GPUs on my prior computers. They do much more work of this "embarrassingly parallel" type BOINC caters to. I won't contribute nearly so much as I did with more powerful GPUs. Sigh. Every little bit helps.
All the best, sir! (I refreshed the browser just before sending. It'll be a surprise if you've already sent.)
Thanks for mentioning. I hadn't heard. I crunched their data on their Classic program. Eventually migrated to BOINC and stuck with them. Eventually found other interesting projects to support. Here we are.
20 years is a long time, measured by age of tech companies, for example. They've done well to excite the public and share their participants with others.
They're doing it again with a new project, "Science United". I need to read more about it before I migrate at all from BOINC.
Take a look in your BOINC data directory for 3 "job log" files. The Einstein one will be called 'job_log_einstein.phys.uwm.edu.txt. The other two will have their respective project URLs in the file name. Here are the last five lines in one of my files, representing the last 5 tasks that were completed by this particular host of mine. I have reduced the font size a little so the lines don't overflow. Each line contains space separated data about each completed task that the host returns.
1583076315 ue 42181.057911 ct 16674.990000 fe 105000000000000 nm LATeah1005F_584.0_74760_0.0_0 et 19380.404697 es 0
1583087341 ue 42181.057911 ct 16659.310000 fe 105000000000000 nm LATeah1005F_664.0_112140_0.0_0 et 19270.846723 es 0
1583095611 ue 42181.057911 ct 16620.390000 fe 105000000000000 nm LATeah1005F_664.0_292440_0.0_0 et 19295.118612 es 0
1583107102 ue 42181.057911 ct 16683.390000 fe 105000000000000 nm LATeah1005F_680.0_118920_0.0_0 et 19759.751162 es 0
1583115714 ue 42181.057911 ct 16764.320000 fe 105000000000000 nm LATeah1005F_680.0_302580_0.0_0 et 20102.229174 es 0
In each line, the first value is the time when the task was returned, expressed in seconds since the 'epoch'. The epoch is midnight on Jan 01, 1970. After that value there are pairs of values - a code followed by a value. Here are the meanings of codes, with the data values as shown in the first line.
ue - the estimated crunch time (eg 42181.057911 secs) - not important for resource share calcs.
ct - the actual CPU time to crunch the task (eg 16674.99 secs) - very important as this is what is used.
fe - the 'flops estimate' (105000000000000) - a measure of the supposed work content.
nm - the name of the actual task that was returned - no influence on anything.
et - the elapsed time taken by the task (eg 19380.404697) - longer than ct if host does other things
This is great detail. Any chance you point to a URL where I could read related information on the web? I haven't found it yet.
Thanks, both! Gary:
)
Thanks, both!
Gary: <snarky>What?! I'm not the most important single factor in your world! Huff!</snarky> :) Good call!
To eliminate ambiguity (sigh!), I seek to equally balance time.
Mike: "A fiat currency no less." Here, here! I seek to be entirely ambivalent to work credits. It's about helping science. Only that. I do it with otherwise-wasted resources.
Mike: Your details of Einstein scheduling are logical, very interesting, and untroubling. My only resulting wish is that I would receive from Einstein either fewer or smaller tasks from within the block of data I have on my hard disk. Even that isn't necessary, as explained below.
Gary: Thanks again, but more deeply now, for suggesting I monitor CPU temperatures. I was heartened that the first couple days of monitoring, the temp seldom went above 80 C, never for long, and was often below 70 C. This morning, it woke up cranky. Core #0 was tending to often exceed 90 C, with other cores only a bit lower. I've dialed SETI back to "Use at most 75% of CPUs" and "Use at most 45% of CPU time", which gives peak and infrequent temps on Core #0 of 80 C (seems acceptable) and lower temps on other cores. BOINC runs 6 tasks (consistent with 75% of 8). The settings also seem to cause highly variable CPU utilization (peaks and valleys, 10% on one one-second reading to 100% on the next) spread equally among all 8 threads (judging by the Windows Resource Monitor; even the shape of the utilization graphs is identical among the eight threads).
If I've had these events before, I've been unaware. Thanks!
If this keeps up, I'll go inside the case to see whether there's a ball of dust somewhere it doesn't belong. This is different behavior than the first days.
I tried TThrottle, advertised to throttle BOINC in response to core temp. It did a calibration run before starting, during which temperatures were steadily above 95 C and sometimes at 100 C for far longer than I was comfortable. I stopped it and uninstalled. The above BOINC controls (though admittedly more crude) are the best I know to do.
Gary: I stopped using concurrency limits because of your supposition that they were contributing to the unexpected task arrivals I was experiencing. I have evidence consistent with your advice. You did me service to ask me to consider it.
Since I started this experiment, I recall only three receipts of Einstein tasks. My notes indicate the first two were Feb 25 and 28 (far too close together). Both were after I started using concurrency limits. Each consisted of 6 tasks each needing around 28 hours. I remember them as consistent with receipts previous to the experiment, but I can't put as much value on those recollections as I can on results while I'm taking notes. During this writing, the third receipt came: 2 (not 6) Einstein tasks (though still of around 28 hours). That's good news. I'll be more reluctant to further consider using concurrency limits. I'll leave them unused for the foreseeable future.
Maybe it resulted from not using concurrency limits.
This occurred after I changed computing preferences due to temperatures; maybe that contributed.
A request for both: Please help me decide whether I've recognized a flaw in the scheduler's handling of my situation before my experiment.
Let's consider any period when a project sends blocks of six tasks, each with 28-hour lengths and three-week deadlines.
Let's assume:
I have eight threads available at 100% capacity.
For the average amounts of time my computer is on, completing those tasks on time requires over half the time I contribute to BOINC during those three weeks.
That's okay with me if my computer goes a period without Einstein tasks after that.
My computer processes that batch of six tasks at the same daily rate throughout the three weeks.
I'll say the 10-day half-life default on scheduler data is too low for the situation. (I'm glad I saw in the configuration files that I can decide on a different value for the half-life (<rec_half_life_days>), but that's another issue and I know both of you respect defaults a great deal. Just saying.)
A characteristic of a half-life is that half of "it" is gone in the half-life. (Examples: radioactive decay and drug persistence after medication.) In this case, we're talking about the effect of past scheduling decisions. (I don't have detailed knowledge and cohesive understanding of the scheduler data, but thinking about this "effect" seems reasonable.)
I suspect the point of the scheduling data is to "remember" decisions like these two: (1) doing too much work for a project, and (2) not doing enough work for a project. In both cases, the scheduler does well if it compensates with future decisions. It cannot do so if it "forgets".
I suspect the design purpose for using the half-life was to "forget" decisions like those at an appropriate rate. Sadly that rate isn't constant for all environments (a reasonable motivation for making it a configuration item).
I suspect that people studied some environment in the past and concluded a 10-day half-life worked well.
With any half-life, a very recent scheduling decision has a powerful effect (call it 100%); we don't want the scheduler to accept/request similar sets of data one immediately after the other. One half-life later, half the effect on current scheduling decisions has decayed (call it 50%). And one more half-life later, half that is gone (25%). It's clear that 100% power is enough to deter accepting/requesting tasks immediately after getting tasks. I don't consider it clear that 25% power is enough.
To the contrary, I can report from my experience that the scheduler frequently accepted/requested more tasks very soon after completing the last task in a batch.
We can't expect the scheduler to make good decisions if the data decays too quickly. Evidence is that it is decaying too quickly for my good.
What's a better half-life for this environment? One candidate answer:
When a big batch of tasks arrives, the sending project needs (for simplification) half of the computer time (1.5 weeks) for the first three weeks. At the same time, the other two get 0.75 weeks. (Total: 3 weeks.)
If no other tasks for that project arrive for another 1.5 weeks, the other two projects each get 0.75 weeks during that time. At that point, all projects have 1.5 weeks of time.
Arguably, the best data on which to make a decision is data collected for the same amount of run time on each project. 4.5 weeks is the earliest point that happens.
Maybe, that's also the best half-life.
After two half-lives, the first results would be decayed to 25% power and the second to 50%. Presuming their results are consistent with each other, the effect has 75% power (the reason you folks so sagely counsel waiting for the scheduler data to catch up). Successive half-lives continue to improve results. If there's a change in the system, it'll work its way through the data eventually. If there's volatility in results, the data will muddle through, encouraging decisions based on middle positions within experiences.
I like that recipe. Maybe that's worth testing.
Maybe, I experienced exactly the behavior described in this thought piece.
I reacted by tweaking resource values. While I tend to think the evidence suggests those tweaks produced better results, I don't think results are solid for lack of data. Perhaps the temperature limits will keep me from collecting more in that environment.
At the same time, I consider it more important to gain data on longer half-lives than to collect more data in the 10-day half-life environment.
Protect me from me, if you will. Where's my thinking improvable?
As always, please forgive for remaining typos.
Garry,If you are serious
)
Garry,
If you are serious about knowing if BOINC is honouring your resource shares or not and, more particularly, knowing the extent of any non-compliance with your resource share settings, here is a suggestion for you.
Take a look in your BOINC data directory for 3 "job log" files. The Einstein one will be called 'job_log_einstein.phys.uwm.edu.txt. The other two will have their respective project URLs in the file name. Here are the last five lines in one of my files, representing the last 5 tasks that were completed by this particular host of mine. I have reduced the font size a little so the lines don't overflow. Each line contains space separated data about each completed task that the host returns.
1583076315 ue 42181.057911 ct 16674.990000 fe 105000000000000 nm LATeah1005F_584.0_74760_0.0_0 et 19380.404697 es 0 1583087341 ue 42181.057911 ct 16659.310000 fe 105000000000000 nm LATeah1005F_664.0_112140_0.0_0 et 19270.846723 es 0 1583095611 ue 42181.057911 ct 16620.390000 fe 105000000000000 nm LATeah1005F_664.0_292440_0.0_0 et 19295.118612 es 0 1583107102 ue 42181.057911 ct 16683.390000 fe 105000000000000 nm LATeah1005F_680.0_118920_0.0_0 et 19759.751162 es 0 1583115714 ue 42181.057911 ct 16764.320000 fe 105000000000000 nm LATeah1005F_680.0_302580_0.0_0 et 20102.229174 es 0
In each line, the first value is the time when the task was returned, expressed in seconds since the 'epoch'. The epoch is midnight on Jan 01, 1970. After that value there are pairs of values - a code followed by a value. Here are the meanings of codes, with the data values as shown in the first line.
ue - the estimated crunch time (eg 42181.057911 secs) - not important for resource share calcs.
ct - the actual CPU time to crunch the task (eg 16674.99 secs) - very important as this is what is used.
fe - the 'flops estimate' (105000000000000) - a measure of the supposed work content.
nm - the name of the actual task that was returned - no influence on anything.
et - the elapsed time taken by the task (eg 19380.404697) - longer than ct if host does other things.
Here's what I suggest you do. At some particular nominated time, write down the details of the very last line showing in each of your 3 files. Wait a decent amount of time (I would suggest a full week) without making any changes to settings. Just let BOINC do its thing. At the end, add up all the 'ct' values for the new lines that have appeared in each of the 3 job log files. You will then know exactly how much CPU time BOINC has allocated to each separate project and you will be able to see how well BOINC is following your resource shares.
Irrespective of how good or bad you think the agreement is, don't touch anything but simply wait another complete period to see what happens. I would be very surprised if BOINC doesn't have things pretty much under control by the end of two full weeks. I have never done this myself but I have confidence that BOINC, left alone, will be able to get pretty close - unless something drastic happens - e.g. Seti out of work for a large slice of the total time.
You need to forget about how many downloaded tasks you see at any single point in time or how many may be running at a particular instant. It's only the completed results that will tell the true story.
Cheers,
Gary.
Garry wrote:... This morning,
)
Garry,
Just a couple of quick points in relation to your recent experience with elevated temperatures. I'm not surprised by your findings of higher temperatures but I am concerned by one of the things you have changed to combat the problem. By all means reduce the threads BOINC is allowed to use. I would suggest 50% until you are sure of how hot things get.
The big no-no in my opinion is to use the setting for % of CPU time. Please realise that using 45% means that the CPUs will run at 100% load for 45% of the time and 0% load for 55% of the time. The change from 0% load to 100% load will occur quite frequently, roughly every second. This will induce thermal cycling in a CPU core which may induce expansion and contraction stresses on delicate structures within a core. You are much better off with more steady temperatures from a reduced number of active cores that aren't thermal cycling like this.
If you can turn off HT so that you are using just 4 'real' cores rather than 8 shared 'threads' you can still use 100% of the cores and you may find that you get a steady, acceptable temperature with each of the 4 running tasks actually running faster than previously so that your work output is not as heavily reduced. If things are still too hot, you could then start dialing down the % of cores to 75% or even 50% if necessary.
I think you may find that using 4 full 'cores' might work out OK. It's at least worth a try.
Cheers,
Gary.
Garry wrote:I suspect that
)
Like Mike, I only run Einstein these days. I ran Seti from 1999 in the classic days until the advent of BOINC in late 2004. Einstein opened its doors in Feb 2005 and that's when I joined. During that year, I ran Seti, Einstein and LHC and thought deeply about the scientific goals of each project and took a lot of notice of how each project was being managed, both in terms of the people involved and the resources they had access to. I decided that Einstein was most likely to offer long term reliability and the opportunity to participate in extremely important discoveries.
I mention all this simply to explain why I have no knowledge or experience of REC or of the half-life mechanism which came a lot later and were probably designed around Seti's requirements at that time. I had long since stopped running Seti at that point.
However I will offer this further comment, which is little more than speculation. One reason why BOINC was developed was really to support Seti in an expanding environment of increasing numbers of volunteers at a time when the classic model was falling apart at the seams. Other projects could start and use the same framework and some of the excess load that Seti couldn't handle could be deployed elsewhere and give Seti some breathing space. The BOINC developers were associated with Seti and would have had Seti's longer term goals in mind while spreading the load to new projects.
As a result, the design of BOINC and the changes over time would have had "Seti friendliness" in mind. That's not any sort of criticism - it's very much what you would expect and I have no issue with that. So the bottom line is that since Seti's long term reliability problems have been around for a very long time, it's natural to expect that defaults for any of the mechanisms that have been put in place are likely to be Seti friendly - unless the developers are really incompetent - which I'm sure they're not :-).
So, my strong feeling is not to fiddle with defaults like half-life unless you are really sure that any counter-advice comes from somebody really reliable who absolutely knows better. I would put someone like Richard Haselgrove in that category and if you went to the BOINC boards and asked about REC and half-life there, you would get top advice from him.
My very vague understanding of half-life is that it can change the "rate of change" in how BOINC attempts to rectify a discrepancy. Too long a half-life means that a correction might be happening too slowly. Too short a half-life might cause the corrections to 'overshoot' and perhaps 'oscillate' rather than reach stability. You would have to think that the designer of the system would have put a lot of thought (and perhaps trial and error) into selecting a value that would be good for the way Seti behaves. This is just speculation - I have no evidence.
Cheers,
Gary.
Gary: Beggin' pardon. In the
)
Gary: Beggin' pardon. In the stream below, there is a preceding message of mine, then three of yours. (Thanks tons for caring so much! Appreciated.) I kept getting interrupted in writing mine, so I wasn't responding to your second and third notes though you sent them before I sent.
You'll note you closed your note with, "I started this reply yesterday. Other things intervened and I wasn't able to finish it then." And I opened with my "<snarky></snarky>" comment (intended tongue-in-cheek).
I'm responding here for the first time to your second and third of those messages.
me: https://boinc.berkeley.edu/trac/wiki/ClientSched "says the scheduler tries to balance work credit."
Sigh. How right you are. I've read more carefully, both that and https://boinc.berkeley.edu/wiki/How_BOINC_works. That information isn't there either.
That said, my theory doesn't support my "assign resources as inverse of credit granted" scheme. You say, "You need to ditch all of that thinking." I agree.
I still have trouble with the fact that when I had equal resource shares, I would typically witness Einstein using six of the eight threads, with the other projects getting one each. And Einstein tasks coming in batches of six 28-hour tasks (168 hours of processing, the same number of hours in a week). And those batches coming in rapid succession.
I describe that as my experience. I don't understand it. I know it conflicts with the common understanding (yours and mine, too) of how BOINC works. And, it's my reality. Do you believe me?
Guaranteed: With the imbalanced resource limits, the time balance was qualitatively better. I didn't operate long enough to have the numbers. They'd bear out, but I don't know how much better. Only, now, I don't know why. Maybe I'll get back to it. The half-life looks more important.
You didn't believe me on my first theory (and you were right; I was off). On the theory about half-life, though? The argument seems compelling that the scheduler doesn't have enough information to know it's time for the next Einstein batch until half-life is far higher than the 10-day one-size-fits-all default so many of us are using.
You don't always drive with your bright lights on, right? Or your windshield wipers? You don't always drive a given route at the same speed or in the same part of your lane, right? Lots of decisions in life are situational. And half-life cannot be?
We're talking, I hope, to problem-solve together.
I'm following your advice. I've said that a couple times. I recognize one acknowledgment of it. I also see concern that I haven't taken your advice.
Consistent with your suggestion, I've been operating with the 30:10:3 resource shares since Sat 2/22, 12 days. And without concurrency limits since Sat 2/29, 5 days.
As to schedulers: I think we're largely in full agreement or discussing issues that don't matter really. We agree that software on the client sends information to the project server and the project server sends tasks. Maybe the result of the collaboration is improvable and we can figure out why.
Here's another thing I find interesting/troubling:
You're right, https://boinc.berkeley.edu/wiki/Preferences says
Reading that, I would expect that when I select "use at most 87.5% of CPUs" (7/8), my Resource Manager would show one thread carrying no BOINC load. (Do you use Windows? See the eight graphs, one per thread, showing utilization?)
The text looks crystal clear.
That's not my experience.
This morning, I set for 12.5% of CPUs. All threads were equally busy. They weren't synchronized (same graph shape, as I've seen before). Total utilization: 25%. One BOINC task active.
Then, 25% (2 threads). All threads were equally busy. Total utilization: 50%. Two BOINC tasks active.
37.5% (3 threads). Equally busy. Total utilization: 75%. Three BOINC tasks active. [There are some patterns here!]
50% (4 threads). Equally busy. Total utilization: 95-100%. Four BOINC tasks active.
62.5% (5 threads). Equally busy. A full 100%. Five tasks.
The words of the documentation don't seem to describe that non-linear behavior!
Why did I get to the "45% of CPU time" setting? Because I was seeing reports of temperatures very close to 100 C and I wanted to get processing down immediately. 87.5% of CPUs had no effect (consistent with above). Same for 75%. Same for 62.5%. I tried the other control and temps came down. That lowered the urgency.
I've investigated more. It appears the 50% of CPUs and 100% of CPU time is a great choice, like you said. Four tasks are running; all threads are essentially fully occupied. Maybe that means those four tasks each have roughly twice the processing power they'd have if running on a single thread. Maybe that means the projects get results in about half the time (once a task starts).
Interestingly (maybe not surprisingly), the queue size is much lower. During my intense data collection, I saw queue sizes of 12 and 14 tasks. Since then, I'm seeing more like 6 or 8 tasks. (7 tasks as I write. Later 6, then 5. I've never seen anything like this before. None are SETI tasks, because of the weekly maintenance. And some shorter Einstein tasks are here, around 16 and 18 hours. Two tasks here are Einstein and three are Rosetta. They seem to swap over time who gets the most threads, exactly as we'd hope BOINC would react. When SETI tasks arrive---the most recent was a batch of six---they fly through the system and don't get replaced promptly. You're right. That's not a balancing problem; it's a flow of work problem.)
This is interesting. Maybe you can operate your computer at half its current queue size applying almost all the same processing power and completing tasks in half the time.
Or, maybe the 8 threads of "almost fully busy" I see depicted are (consistent with the "50% of CPUs" setting) only doing half the work. Dunno. Yet. We have now talked through the tools to know, though!
Thanks for decoding the job log files for me. I searched for that info and never found it. Yours will help fine. I suspected Unix epoch data was in there; I could decode it, but not know the meaning. Have both, now.
All the best. Next tasks: Decode everything in the job log files. Excel should handle it nicely.
Thanks kindly. News when I have it.
First of all, carrying on a
)
First of all, carrying on a very detailed and far reaching discussion via a message board and with quite different time zones is difficult for all parties concerned. I have a big 'fatal flaw'. I never use one word when 100 will do :-). I tend to over-explain and thereby perhaps obfuscate the points I'm trying to make. I'm very persistent. As long as the conversation continues, I'll do my best to respond. If a new thought springs to mind, I tend to fire off an extra mesage or three. If responses take time, so be it. I don't stress over delayed replies. There are always pressures from real life outside message boards.
I certainly believe it's the reality you think you observe. It may also be the transient reality that BOINC was being forced into by the combination of a number of factors such as:-
Unless you spent a lot of time doing regular observations of exactly what was being processed and what projects were being asked for work (and exactly how much work was being asked for) it would be quite difficult to form a really accurate picture of the nature of any departure from expected behaviour.
There is another point here as well. The number of tasks 'waiting to run' and their time estimates can be a very deceptive indicator of the true state of affairs after the tasks are crunched and the real crunch times are recorded. You might think there were always too many tasks for one project waiting to run but the true guide would be the total time of those that were actually completed and returned.
Now that you have info on the content of job logs, you could go back in time to say 22 Feb (a date you mentioned) and count up the true crunch times for all 3 projects for tasks completed after that point. How those three totals compare with each other would be very interesting and might be rather different from what you thought was happening. BOINC does not necessarily have things in balance in the short term so I would certainly think it quite likely that Seti could be behind the others in its share but perhaps not as much as you imagine.
If you're interested, the seconds from the epoch for my time zone (UTC+10) for the start of 22 Feb is 1582293600 secs. You might have to adjust that for the difference between your time zone and mine. I imagine the times recorded in the job log files are local times.
Now that concurrency limits are gone, I don't believe there is any problem to solve. If you want to keep exploring REC half-life, I'm truly not the person to do that with. I have zero understanding or experience with it and I have no real way to contribute intelligently to how it may help or hinder the BOINC behaviour in relation to resource shares. I suspect (but don't know) that it was designed to help Seti get its designated share of resources. Now that Seti is closing its doors, I suspect it will be a lot easier for the more reliable projects to work properly without having to worry about adjusting half-life.
No, I don't use Windows. I've used Linux for the last 13 years.
If you think there should be one thread showing no BOINC load, you are not understanding how modern operating systems schedule the workload. The first thing is that 87.5% means BOINC is allowed to run 7 tasks on an 8 thread machine. That absolutely does not mean that there will be a thread carrying no BOINC load. BOINC simply says to the OS, "run these 7 jobs". It's entirely up to the OS how that is achieved. You also need to remember (because of HT) you only have 4 real cores. Two threads can occupy the same core and load will show for both threads with just one task running on that core.
At any one time there are a very large numbers of processes that the OS has to schedule. Many are very short term but potentially of higher priority than the compute tasks. Those tasks will be swapped around between threads at the convenience of the OS. In its lifetime, a long running crunch task will probably migrate between threads many times. On average you would expect to see all 8 threads having substantial utilisation and perhaps averaging out rather higher than 87.5%. Those utilisation numbers are never fully accurate so it wouldn't surprise to see all threads showing 100% even if there were less than 8 tasks running.
You would expect some sharing across all threads to happen in much the same way, even with just 1 BOINC task running. The figure you quote of 25% for all threads isn't all that surprising. Nor is the 50% value when there were just 2 BOINC tasks. Both seem a little high but not surprising. Bear in mind that I have no knowledge or experience with how Windows arrives at these numbers.
The crunch time will certainly come down quite a bit but not as low as one half. I suggested that particular setting because I felt that the OS should be smart enough not to put 2 BOINC tasks together on the one core but rather have a task per core with the other short lived higher priority OS type jobs able to access the 2nd 'half' of those 4 cores. You will know this is happening if all BOINC tasks seem to be equally reduced in their crunch times compared to when you were running all 8. Someone with Windows experience would be better able to give advice about that. It seems to work that way in Linux, although I don't get much experience since for several years now I've mainly run GPU tasks only.
Cheers,
Gary.
Gary Roberts wrote: My very
)
Your understanding isn't vague. It's spot on.
A question before us: Is the default setting causing "corrections to 'overshoot' and perhaps 'oscillate' rather than reach stability"?
I could nominate repeated inappropriate receipt of huge batches of tasks as improvable decisions characteristic of half-life being too short.
On the other hand, with this new configuration of "use at most 4 CPUs", I'm seeing very different queue sizes, even within the first two days. I accept the likelihood that prior data is not a solid basis for assessing this configuration. I set this configuration yesterday. The first half-life might demonstrate early data of interest.
Perhaps for this configuration, the decisions and the default half-life will prove fine.
Once again, I wrote without
)
Once again, I wrote without knowing you'd sent a message. I thought I checked. If it happens this time, I don't know any technique to check!
Well, thanks for that. You make it sound like self-criticism. Likely many observers of our conversation would send "a plague on both their houses". For a concept that needs 10 words, both 5 and 20 words are improvable. I aim to use 10. Others know more techniques for communication, no doubt.
True as to transient. I am frequently using "maybe" with these assertions for your reasons (hopefully "always", not "frequently"; I'm human).
* The scheduler hasn't served me well (no maybe).
* I thought I could compensate with changes to resource values. (Abandoned. My approach wasn't good; I don't know whether this is true).
* Recently, my computer has received too few SETI tasks to balance time among the projects (no maybe). This hasn't always been true.
* The scheduler reacted to my abandoning concurrency limits (no maybe). (Probably, I should word this as a suspicion, recognizing the time between then and using "use at most 50% of CPUs" was short. I view the change as pronounced.)
* The scheduler reacted to my change to "use at most 50% of CPUs" (no maybe).
* Maybe, my computer is almost as busy now (at "use at most 50% of CPUs") as before. Suspicion; want numbers.
* Maybe, the 10-day half-life is too low for my environment. (Honestly, I'm confident of this based on past experience and understanding, and on previous environment (using "at most 100% of CPUs"). And I want to prove it to myself with data. This may not be true of "use at most 50% of CPUs".
* Maybe, undetected temperature events are in my past. Conjecture. Wishful thinking.
* It's less likely I'll have relevant temperature events in the future, due to using the "overheat protection" feature of the temperature monitoring software I installed. I have you to thank (no maybe).
* Casually watching the mix of tasks, say, four times a day for to months and observing that SETI tasks typically have one or two threads when they run while other projects often have six threads for long periods is adequate observation to know the scheduler isn't performing well (no maybe). Starting more careful note taking to collect real data is necessary to convince others (perfectly reasonable).
True. We haven't discussed this. Maybe, all projects tend to run longer (say, 120% of the early estimates; perception and not carefully measured). And, batches of six 8-hour SETI tasks affect my computer less than batches of six Einstein 28-hour tasks (no maybe, even if only on observation).
Its theory is well established and not project-specific in this usage. If it was intended to favor SETI unfairly, it backfired in this case (I don't favor the "SETI favorite theory" it's a valid technique for sharing among projects; as to backfire: no maybe; observation is sufficient). The short task times for tasks SETI sends match well with a 10-day half life (my experience with half-life). The long task times I was getting from others can overwhelm projects with short task durations (my experience with half-life).
Nice resource: https://www.epochconverter.com/
Whoa! Australia? New Zealand? Indonesia? I love the Internet! Personal question. No need to answer. None of my business.
Come to think of it, I'd seen European-influenced spellings in your messages (eg., "behaviour"). I used to very distracted. No more. There's more than one "normal" (including other languages, of which I have none).
I'm in Nebraska, near the center of the US "lower 48" (the states other than Alaska and Hawai'i).
(he he) There's a great deal of depth to modern operating systems. I know more about operating systems than to validate your suspicions. I also know what "Use at most N % of the CPUs: Keeps some CPUs free for other applications. Example: 75% means use 6 cores on an 8-core CPU." means (emphasis mine). It's just wrong. It doesn't describe now. I thought the page was older, but that page was last modified in 2019; it doesn't describe then. But yes, I should have recognized it right off. But it's a side point ...
What? A Linux guy calling Windows a "modern operating system"? Doesn't that violate the conventional wisdom of the Linux Enthusiast Club? Are you risking expulsion? :)
But honestly, both OSs have come a long way since the rivalry was stronger. Still, an overwhelming percentage of the developers I've worked with (not that I claim to be one) have wanted Linux for their development computer. Now, many are saying MacOS is the best platform for all three. But I digress ...
I've run GPUs on my prior computers. They do much more work of this "embarrassingly parallel" type BOINC caters to. I won't contribute nearly so much as I did with more powerful GPUs. Sigh. Every little bit helps.
All the best, sir! (I refreshed the browser just before sending. It'll be a surprise if you've already sent.)
Gary Roberts wrote:Now that
)
Thanks for mentioning. I hadn't heard. I crunched their data on their Classic program. Eventually migrated to BOINC and stuck with them. Eventually found other interesting projects to support. Here we are.
20 years is a long time, measured by age of tech companies, for example. They've done well to excite the public and share their participants with others.
They're doing it again with a new project, "Science United". I need to read more about it before I migrate at all from BOINC.
Gary Roberts wrote: Take
)
This is great detail. Any chance you point to a URL where I could read related information on the web? I haven't found it yet.
Thanks tons!