Thanks for additional details. OK you're not proposing to run until finished but now it sounds like what you want is what we already have. Maybe you're seeing tasks with 2 hours time being run when tasks with 4 hours are waiting because the task with 4 hours was running on 1 core when the other core(s) switched?
I don't have a problem with the scenario you portray here, that's just the way the ball bounces depending on when BOINC switches tasks. So then, picking up on your scenario, why does BOINC restart the 2 hour WUs and not the four hour WU when it resumes E@h computations? Seems odd doesn't it?
My problem is what I've seen many times that is slightly different. Suppose E@h is running on one core with five hours CPU time completed. No other E@h WUs have started. When BOINC starts running E@h again, sure enough, it starts a another E@h WU. So now I have one with five hours completed and "waiting to run" while another is started, and it runs for perhaps three hours before switching tasks to another project. Do you see my point now? BOINC should have continued the five hour WU, not started another WU. With four cores, sometimes it runs with the second work unit with three hours completed rather than running the first five hour WU when E@h restarts. Or it might even start a third WU and run it for two hours. So now I have three WUs started and none completed. It all works out statistically in the end, of course, but prioritizing the WU with the most CPU time completed would seem to keep the cache a bit cleaner.
Indeed, I was just thinking that there is a scenario that very well could play out that could get BOINC into trouble and miss a deadline. Suppose an Orbit task that has run for, say, 560 hours, is near completion. A second Orbit task has, say, 87 hours CPU time also. So now I have two Orbit tasks that are incomplete. The task with 560 hours CPU time completed now has two days left and now turns into a "high priority" task and continues to run, unaware that the task will not complete in the 48 hours that it thinks will complete the task. Instead, the 560 hour task will complete in 59 hours and miss the deadline because Orbit tasks are notorious for getting the time to completion incorrect. So, here is a very realistic scenario that I had not considered before where, because BOINC did NOT prioritize the WU with the most CPU time completed, I get screwed for well over, yes, 600 hours of CPU time (assuming, of course, that the WU errors out too many times by future hosts, and thereby never validates, whereas my on-time WU would have gotten credit).
So, unless I "micromanage" my resource share for Orbit so that I am relatively certain that my two tasks will complete on time, I just got screwed. In fact, that scenario is playing out right now with my C2D lappy! One Orbit WU will complete on time at the current 10% resource share, but the second WU will not complete on time, and at current calculations, which are only estimates, by the way, it will complete about one day late. That, my friends, is why I think BOINC should be changed.
My problem is what I've seen many times that is slightly different. Suppose E@h is running on one core with five hours CPU time completed. No other E@h WUs have started. When BOINC starts running E@h again, sure enough, it starts a another E@h WU. So now I have one with five hours completed and "waiting to run" while another is started, and it runs for perhaps three hours before switching tasks to another project. Do you see my point now?
Gerry, now I understand what you're seeing. I've never seen it here, perhaps because I have only single core machines and keep very small (.1 day) caches. What versions of BOINC have you seen this on? Anybody else seeing it?
I'll bump my caches up and simulate multiple cores with the ncpu tag in cc_config.xml. You really have me curious now!!
That non-linear behavior of Orbit tasks is one of the few objections that makes sense. Which is why I suggested that the setting be by project in that CPDN, Orbit, etc. would obviously not have this setting applied by most users. There are some CPDN enthusiasts that might disagree but they too have had some issues with the CPU scheduling and the leaving of some tasks "hanging".
The cases cited were much like my case with Soduko where they wanted the tasks in flight completed as soon as possible. But, there is very little that the participant can do to help the project complete these tasks. The one alternative proposed would be to allow the project to over-ride the participant's desires and to, in effect, hijack their system. Fine for the project, but, for some of us this is not necessarily what we signed up for. Besides, what would prevent projects from abusing this control?
My problem is what I've seen many times that is slightly different. Suppose E@h is running on one core with five hours CPU time completed. No other E@h WUs have started. When BOINC starts running E@h again, sure enough, it starts a another E@h WU. So now I have one with five hours completed and "waiting to run" while another is started, and it runs for perhaps three hours before switching tasks to another project. Do you see my point now?
Gerry, now I understand what you're seeing. I've never seen it here, perhaps because I have only single core machines and keep very small (.1 day) caches. What versions of BOINC have you seen this on? Anybody else seeing it?
I'll bump my caches up and simulate multiple cores with the ncpu tag in cc_config.xml. You really have me curious now!!
This is a follow up post.
Gerry, I detached all projects except 3: Rosetta, Einstein and ABC. Then I set them all to the same resource share (33.3%) and set LTD and STD to zero for all 3. I am using ncpus in cc_config.xml to fake an additional core for a total of 2. I increased the cache to 5 days (0 for "connect every" and 5 for "additional days") and now have no less than 6 tasks for each of the 3 projects. I set the "switch between applications" to 10 minutes to speed this experiment along.
At this point the host has been through at least a dozen 10 minute switch intervals. I am definitely NOT seeing the behavior you described where at each switch a new task is started while a task from the same project with "waiting to run" status is ignored. When it switches switches it is definitely restarting tasks that are "waiting to run". Running 64-bit BOINC 6.4.3 on 64-bit Linux (Fedora 5).
Gerry, if you have a fly in your soup then I want one too. I'll leave this experiment run for a few days to see what happens. I've just changed the "switch apps about every" setting to the default 4 hours because the 10 minutes probably isn't close to anything most crunchers will use.
Other than going out and buying a real multi_core, can you think of anything I could change that is likely to cause the behavior you are seeing?
Just a thought but might scheduling work better if the BOINC system, rather than the project, set the deadlines based on the est_flops and the average of the DCF's, the average on BOINC on time etc, reported back to the servers.
Although there is one slight problem with this as DCF is by project not application.
...I've just changed the "switch apps about every" setting to the default 4 hours because the 10 minutes probably isn't close to anything most crunchers will use...
Has that default changed recently? The Computing preferences of Einstein, Seti, Lattice and CPDN (my current set of projects) still state:
Switch between applications every
(recommended: 60 minutes)
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Thanks for additional details. OK you're not proposing to run until finished but now it sounds like what you want is what we already have. Maybe you're seeing tasks with 2 hours time being run when tasks with 4 hours are waiting because the task with 4 hours was running on 1 core when the other core(s) switched?
I don't have a problem with the scenario you portray here, that's just the way the ball bounces depending on when BOINC switches tasks. So then, picking up on your scenario, why does BOINC restart the 2 hour WUs and not the four hour WU when it resumes E@h computations? Seems odd doesn't it?
My problem is what I've seen many times that is slightly different. Suppose E@h is running on one core with five hours CPU time completed. No other E@h WUs have started. When BOINC starts running E@h again, sure enough, it starts a another E@h WU. So now I have one with five hours completed and "waiting to run" while another is started, and it runs for perhaps three hours before switching tasks to another project. Do you see my point now? BOINC should have continued the five hour WU, not started another WU. With four cores, sometimes it runs with the second work unit with three hours completed rather than running the first five hour WU when E@h restarts. Or it might even start a third WU and run it for two hours. So now I have three WUs started and none completed. It all works out statistically in the end, of course, but prioritizing the WU with the most CPU time completed would seem to keep the cache a bit cleaner.
Indeed, I was just thinking that there is a scenario that very well could play out that could get BOINC into trouble and miss a deadline. Suppose an Orbit task that has run for, say, 560 hours, is near completion. A second Orbit task has, say, 87 hours CPU time also. So now I have two Orbit tasks that are incomplete. The task with 560 hours CPU time completed now has two days left and now turns into a "high priority" task and continues to run, unaware that the task will not complete in the 48 hours that it thinks will complete the task. Instead, the 560 hour task will complete in 59 hours and miss the deadline because Orbit tasks are notorious for getting the time to completion incorrect. So, here is a very realistic scenario that I had not considered before where, because BOINC did NOT prioritize the WU with the most CPU time completed, I get screwed for well over, yes, 600 hours of CPU time (assuming, of course, that the WU errors out too many times by future hosts, and thereby never validates, whereas my on-time WU would have gotten credit).
So, unless I "micromanage" my resource share for Orbit so that I am relatively certain that my two tasks will complete on time, I just got screwed. In fact, that scenario is playing out right now with my C2D lappy! One Orbit WU will complete on time at the current 10% resource share, but the second WU will not complete on time, and at current calculations, which are only estimates, by the way, it will complete about one day late. That, my friends, is why I think BOINC should be changed.
2 things Gerry, maybe you need to run only one project on the machines that run Orbit. Orbit seems to be a project that is giving you time trouble, and putting in 560 hours of crunching time and then just loosing it, is NOT right!!! At least until Boinc can be fixed or figured out.
And will you look at the time and date as to when the new Einstein unit is due when it starts the new unit instead of finishing the already in progress unit? I am wondering if it is seeing the longer unit as being due to be returned sooner than the shorter unit, so it is getting busy on it?
Gerry, I detached all projects except 3: Rosetta, Einstein and ABC. Then I set them all to the same resource share (33.3%) and set LTD and STD to zero for all 3. I am using ncpus in cc_config.xml to fake an additional core for a total of 2. I increased the cache to 5 days (0 for "connect every" and 5 for "additional days") and now have no less than 6 tasks for each of the 3 projects. I set the "switch between applications" to 10 minutes to speed this experiment along.
At this point the host has been through at least a dozen 10 minute switch intervals. I am definitely NOT seeing the behavior you described where at each switch a new task is started while a task from the same project with "waiting to run" status is ignored. When it switches switches it is definitely restarting tasks that are "waiting to run". Running 64-bit BOINC 6.4.3 on 64-bit Linux (Fedora 5).
Gerry, if you have a fly in your soup then I want one too. I'll leave this experiment run for a few days to see what happens. I've just changed the "switch apps about every" setting to the default 4 hours because the 10 minutes probably isn't close to anything most crunchers will use.
Other than going out and buying a real multi_core, can you think of anything I could change that is likely to cause the behavior you are seeing?
Geez, I wasn't planning to start a full fledged insurrection on Einstein!! :o)
I am using Vista SP2, so I don't know what I can do for you. If I were you, I would bump up your cache and leave everything else the same: that is what I have on my systems. I currently have a BAM! setting of 2 days cache. Other than that, I need help from others who are following this discussion to help Dago and others replicate the experiment. I don't remember if this behavior was there during the days when I had single core boxes. I think it was there as well, but that is just speculation.
Can anyone else help with this experiment or have experience with this?
2 things Gerry, maybe you need to run only one project on the machines that run Orbit. Orbit seems to be a project that is giving you time trouble, and putting in 560 hours of crunching time and then just loosing it, is NOT right!!! At least until Boinc can be fixed or figured out.
And will you look at the time and date as to when the new Einstein unit is due when it starts the new unit instead of finishing the already in progress unit? I am wondering if it is seeing the longer unit as being due to be returned sooner than the shorter unit, so it is getting busy on it?
Fortunately haven't lost anything yet, probably because I monitor my boxes. I will check for the next couple of days to see, but I think I've already thought of this and discounted it.
Right now all of my E@h is uploaded with nothing waiting. Right now I have 2 Rosettas waiting at 23% and 63%, and, yipper, 3 WCGs at 59m, 27% and waiting, 2h5m 47% and running high priority, and one at 4h 17m and 90% and waiting, all 3 due 2/22 at different times. On WCG, I did notice a day or so ago that I am overcommittted with WUs with that resource share, 5%. So it does make sense that WCG is trying to catch up a touch. This does answer your question I think.
Geez, I wasn't planning to start a full fledged insurrection on Einstein!! :o)
I'm not above a little insurrection when the circumstances warrant it but I don't think we're looking at insurrection. Let's just call it getting involved, checking our work and making sure each of us knows what the other is talking about.
Quote:
I am using Vista SP2, so I don't know what I can do for you. If I were you, I would bump up your cache and leave everything else the same: that is what I have on my systems. I currently have a BAM! setting of 2 days cache.
I would think the scheduler and work fetch code is the same for all platforms so I probably shouldn't have even mentioned I'm on Linux. My cache is already bumped to 5 days, to make sure I have enough tasks to see if the behavior you seem to me to be seeing holds true on my host as well.
Again, maybe I just don't understand what you're saying? Or seeing? You seem to be saying that if you have, for example, 4 cores, 3 projects and 10 tasks cached for each project (30 tasks total) then BOINC will eventually start and preempt each and every one of those 30 tasks but not restart a task that has status "ready to start". I am definitely not seeing that. Each project has, at most, X tasks with status "running" or "waiting to run", where X = the number of cores. I've tried it with BOINC versions 5.10.45, 6.2.15 and 6.4.3.
Since I reset all the debts to 0 at the start of this experiment, I will let the experiment run for a few more days, maybe even weeks, just to see if mounting debts somehow affect it and cause me to have greater than X started tasks per project where X = number of cores.
RE: @Gerry, Thanks for
)
I don't have a problem with the scenario you portray here, that's just the way the ball bounces depending on when BOINC switches tasks. So then, picking up on your scenario, why does BOINC restart the 2 hour WUs and not the four hour WU when it resumes E@h computations? Seems odd doesn't it?
My problem is what I've seen many times that is slightly different. Suppose E@h is running on one core with five hours CPU time completed. No other E@h WUs have started. When BOINC starts running E@h again, sure enough, it starts a another E@h WU. So now I have one with five hours completed and "waiting to run" while another is started, and it runs for perhaps three hours before switching tasks to another project. Do you see my point now? BOINC should have continued the five hour WU, not started another WU. With four cores, sometimes it runs with the second work unit with three hours completed rather than running the first five hour WU when E@h restarts. Or it might even start a third WU and run it for two hours. So now I have three WUs started and none completed. It all works out statistically in the end, of course, but prioritizing the WU with the most CPU time completed would seem to keep the cache a bit cleaner.
Indeed, I was just thinking that there is a scenario that very well could play out that could get BOINC into trouble and miss a deadline. Suppose an Orbit task that has run for, say, 560 hours, is near completion. A second Orbit task has, say, 87 hours CPU time also. So now I have two Orbit tasks that are incomplete. The task with 560 hours CPU time completed now has two days left and now turns into a "high priority" task and continues to run, unaware that the task will not complete in the 48 hours that it thinks will complete the task. Instead, the 560 hour task will complete in 59 hours and miss the deadline because Orbit tasks are notorious for getting the time to completion incorrect. So, here is a very realistic scenario that I had not considered before where, because BOINC did NOT prioritize the WU with the most CPU time completed, I get screwed for well over, yes, 600 hours of CPU time (assuming, of course, that the WU errors out too many times by future hosts, and thereby never validates, whereas my on-time WU would have gotten credit).
So, unless I "micromanage" my resource share for Orbit so that I am relatively certain that my two tasks will complete on time, I just got screwed. In fact, that scenario is playing out right now with my C2D lappy! One Orbit WU will complete on time at the current 10% resource share, but the second WU will not complete on time, and at current calculations, which are only estimates, by the way, it will complete about one day late. That, my friends, is why I think BOINC should be changed.
(Click for detailed stats)
RE: My problem is what I've
)
Gerry, now I understand what you're seeing. I've never seen it here, perhaps because I have only single core machines and keep very small (.1 day) caches. What versions of BOINC have you seen this on? Anybody else seeing it?
I'll bump my caches up and simulate multiple cores with the ncpu tag in cc_config.xml. You really have me curious now!!
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
That non-linear behavior of
)
That non-linear behavior of Orbit tasks is one of the few objections that makes sense. Which is why I suggested that the setting be by project in that CPDN, Orbit, etc. would obviously not have this setting applied by most users. There are some CPDN enthusiasts that might disagree but they too have had some issues with the CPU scheduling and the leaving of some tasks "hanging".
The cases cited were much like my case with Soduko where they wanted the tasks in flight completed as soon as possible. But, there is very little that the participant can do to help the project complete these tasks. The one alternative proposed would be to allow the project to over-ride the participant's desires and to, in effect, hijack their system. Fine for the project, but, for some of us this is not necessarily what we signed up for. Besides, what would prevent projects from abusing this control?
RE: RE: My problem is
)
This is a follow up post.
Gerry, I detached all projects except 3: Rosetta, Einstein and ABC. Then I set them all to the same resource share (33.3%) and set LTD and STD to zero for all 3. I am using ncpus in cc_config.xml to fake an additional core for a total of 2. I increased the cache to 5 days (0 for "connect every" and 5 for "additional days") and now have no less than 6 tasks for each of the 3 projects. I set the "switch between applications" to 10 minutes to speed this experiment along.
At this point the host has been through at least a dozen 10 minute switch intervals. I am definitely NOT seeing the behavior you described where at each switch a new task is started while a task from the same project with "waiting to run" status is ignored. When it switches switches it is definitely restarting tasks that are "waiting to run". Running 64-bit BOINC 6.4.3 on 64-bit Linux (Fedora 5).
Gerry, if you have a fly in your soup then I want one too. I'll leave this experiment run for a few days to see what happens. I've just changed the "switch apps about every" setting to the default 4 hours because the 10 minutes probably isn't close to anything most crunchers will use.
Other than going out and buying a real multi_core, can you think of anything I could change that is likely to cause the behavior you are seeing?
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
Just a thought but might
)
Just a thought but might scheduling work better if the BOINC system, rather than the project, set the deadlines based on the est_flops and the average of the DCF's, the average on BOINC on time etc, reported back to the servers.
Although there is one slight problem with this as DCF is by project not application.
RE: ...I've just changed
)
Has that default changed recently? The Computing preferences of Einstein, Seti, Lattice and CPDN (my current set of projects) still state:
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: RE: @Gerry, Thanks
)
2 things Gerry, maybe you need to run only one project on the machines that run Orbit. Orbit seems to be a project that is giving you time trouble, and putting in 560 hours of crunching time and then just loosing it, is NOT right!!! At least until Boinc can be fixed or figured out.
And will you look at the time and date as to when the new Einstein unit is due when it starts the new unit instead of finishing the already in progress unit? I am wondering if it is seeing the longer unit as being due to be returned sooner than the shorter unit, so it is getting busy on it?
RE: This is a follow up
)
Geez, I wasn't planning to start a full fledged insurrection on Einstein!! :o)
I am using Vista SP2, so I don't know what I can do for you. If I were you, I would bump up your cache and leave everything else the same: that is what I have on my systems. I currently have a BAM! setting of 2 days cache. Other than that, I need help from others who are following this discussion to help Dago and others replicate the experiment. I don't remember if this behavior was there during the days when I had single core boxes. I think it was there as well, but that is just speculation.
Can anyone else help with this experiment or have experience with this?
(Click for detailed stats)
RE: 2 things Gerry, maybe
)
Fortunately haven't lost anything yet, probably because I monitor my boxes. I will check for the next couple of days to see, but I think I've already thought of this and discounted it.
Right now all of my E@h is uploaded with nothing waiting. Right now I have 2 Rosettas waiting at 23% and 63%, and, yipper, 3 WCGs at 59m, 27% and waiting, 2h5m 47% and running high priority, and one at 4h 17m and 90% and waiting, all 3 due 2/22 at different times. On WCG, I did notice a day or so ago that I am overcommittted with WUs with that resource share, 5%. So it does make sense that WCG is trying to catch up a touch. This does answer your question I think.
(Click for detailed stats)
RE: Geez, I wasn't planning
)
I'm not above a little insurrection when the circumstances warrant it but I don't think we're looking at insurrection. Let's just call it getting involved, checking our work and making sure each of us knows what the other is talking about.
I would think the scheduler and work fetch code is the same for all platforms so I probably shouldn't have even mentioned I'm on Linux. My cache is already bumped to 5 days, to make sure I have enough tasks to see if the behavior you seem to me to be seeing holds true on my host as well.
Again, maybe I just don't understand what you're saying? Or seeing? You seem to be saying that if you have, for example, 4 cores, 3 projects and 10 tasks cached for each project (30 tasks total) then BOINC will eventually start and preempt each and every one of those 30 tasks but not restart a task that has status "ready to start". I am definitely not seeing that. Each project has, at most, X tasks with status "running" or "waiting to run", where X = the number of cores. I've tried it with BOINC versions 5.10.45, 6.2.15 and 6.4.3.
Since I reset all the debts to 0 at the start of this experiment, I will let the experiment run for a few more days, maybe even weeks, just to see if mounting debts somehow affect it and cause me to have greater than X started tasks per project where X = number of cores.
BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux