Last Saturday, 12/7/13 the motherboard failed on my W8 machine. It got replaced yesterday and picked up right where it was, as expected. Since I am running a small cache, 3 days, nothing was in danger of timing out. Well the machine aborted 20 BRP4G-cuda32-nv301 tasks which I am running 3 at a time on the GTX660. My guess is that the scheduler did not take that into account and thought the card was 3 times slower than it is and wasted project resources. Not a big deal, the only thing that suffered was my RAC but I am curious.
Copyright © 2024 Einstein@Home. All rights reserved.
Aborted by user
)
It's more likely that the 7 day outage caused BOINC to reduce the estimate of future availability of crunch time to the point that there wouldn't be enough 'on_frac' to guarantee the completion of all tasks in the cache. BOINC isn't capable of knowing that you might well have 100% availability once you get the machine back on line. It has to assume the worst, based on recent history.
In order to prevent the problem, you can edit the state file (client_state.xml) before you allow BOINC to communicate with the project. You have to do a few things in a particular order and there are several strategies to achieve the same outcome but it's quite feasible to convince BOINC that all is well and no tasks are at risk.
Probably the easiest strategy is to unplug the network cable before firing up BOINC. That way you can get into the manager and change settings there before BOINC can report its concerns to the project. When this happens to me, the first thing I do after starting BOINC (with no network cable) is suspend most of the tasks (except the running ones) in the cache. That gets BOINC out of panic mode and then I assess if all the tasks can be completed on time or not. If they can and if I don't want to be bothered handling panic mode, I stop BOINC and edit the state file to change something like 0.432852 to 0.999999. Then I restart BOINC and resume all the suspended tasks (slowly - in stages) to see if BOINC is still happy.
If it seems there is a real risk of deadline misses, there are still options to work around the problem. It is possible to ask the scheduler, very nicely :-) to give you a deadline extension on some (or even all) of the 'at risk' tasks. It's amazing how truly understanding the scheduler can be if you explain your predicament properly :-). You should be able to work that out for yourself. The clue is to think about the 'resend lost tasks' feature that is switched on at Einstein. It works beautifully :-).
Over the years, I've had the odd hard disk complete failure and I've had to reinstall everything from scratch. You would think that the tasks that were in the cache at the time of failure would be totally lost. In fact, it is quite easy to do a new BOINC installation and then get it to once again talk nicely to the server and explain that it should send fresh copies of all the outstanding tasks. Essentially, you are making your new installation adopt the complete identity of the old one and once again the scheduler will, very cooperatively, send you fresh copies of all your lost tasks. It's very good to so easily be able to prevent a whole bunch of tasks sitting in limbo on the server, waiting to time out.
Cheers,
Gary.
Interesting. I shall crunch
)
Interesting. I shall crunch on.
Thanx, B