Rerun task

lightning_anime
lightning_anime
Joined: 8 Nov 10
Posts: 5
Credit: 2564729830
RAC: 1437414
Topic 205342

Apparently Microsoft decided that I needed my NVidia GPU drivers updated on my Win 10 Pro PC and graciously updated the drivers for me while I was not at my computer.  This however resulted in the GPU tasks in-work and subsequent task all failing as BOINC was running at the time the drivers were updated.  I now have 50+ GPU tasks that resulted in computation error.  The results have not been uploaded to the server yet as network activity is suspended.

Applications: FGRPopencl-Beta-nvidia 1.18 and FGRPopencl1K-nvidia 1.18

Example task: LATeah0011L_1172.0_0_0.0_19546625_0

 

Is their a way to rerun the tasks?  I still have over a week to process the tasks?  I can't find anything that looks like an output file or matches the task name.

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7346421687
RAC: 2223153

My suggestion for recovery is

My suggestion for recovery is to consider those jobs gone, and concentrate on a clean restart.  I'd suspend all remaining unstarted tasks, then reboot, then release just one set of tasks and check that they run successfully to completion before unsuspending another set--continuing until you have validations.

Let the servers handle the task of handing the errored work back out to another system. 

One way or another, most of us have zipped through erroring out a cache under some circumstance or another.  

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5883
Credit: 119073187942
RAC: 24162583

lightning_anime wrote:Is

lightning_anime wrote:
Is their a way to rerun the tasks?  I still have over a week to process the tasks?

Yes there is.  I have done exactly this within the last 24 hours on a machine where the entire cache of GPU tasks became 'compute errors'  (apart from two that were sitting there 'ready to report'.  Before going any further, let me point out that the simplest course of action is to do what archae86 advised.  If you're interested, I'll explain what happened in my case.

My problem was because of a current heatwave.  Today's temp was 36C and tomorrow is predicted to be 39C.  The machine was still crunching CPU tasks but the GPU tasks failed because the GPU app was deemed by BOINC to be corrupt, despite the fact that GPU tasks were crunching.  From time to time BOINC seems to perform a checksum test and for some reason the check failed.  I've had this happen before.  The app isn't actually corrupt.  The routine that works out the checksum seems to get it wrong sometimes in high ambient temperature situations.

Be that as it may, the reason isn't important.  The upshot is that there is a host in distress, also with about 50 failed tasks in a ~18 hour backoff that will count down to zero before all the failed tasks will be reported (unless I click 'update' to override the backoff).  A mass failure like this seems to cause these lengthy backoffs.  I don't really know why but I'm forever grateful for it.  It allows me to retrieve the entire cache of work rather than trashing it.

The recovery procedure involves setting NNT (no new tasks) in BOINC Manager, then stopping BOINC and editing the state file (client_state.xml).  This is NOT a normal or routine procedure and should not be attempted unless you have a really good quality plain text editor and a decent understanding of the structure of the state file.  All you are going to do is cause your BOINC client to 'forget' about all the failed tasks by simply deleting them.  Each separate task is bounded by <result> .... </result> tags and you need to delete every single one of them.  On mass failures like this, it's usually pretty simple because the failed results are usually consecutively listed in the state file.  If you identify the start of the very first failed task and the end of the very last failed task, and if there are no 'good' tasks somewhere in the middle of the list, you can delete the whole lot in one simple operation.  It's pretty easy to recognise a failed task.  There is the failure message embedded in each one so they are really obvious.

There are also a lot of <file> .... </file> blocks that are associated with the tasks that have a <status> of zero normally.  An error status of -161 seems to get inserted in these.  I suspect these can probably be ignored as the removal of the <result> blocks probably causes BOINC to remove the associated <file> blocks anyway.  I usually do a global search and replace to change all such -161 values to zero and it's always been successful.  Next time I might just delete the error <result> blocks and leave the -161s and see what happens :-).

After editing and saving the state file, I restart BOINC.  There are no complaints (if I'm careful to avoid mistakes) and the tasks tab now shows just the ones that weren't errors.  With NNT still set, I 'update' the project.  This causes the client to tell the server exactly what it has on board.  The server immediately notices all the missing tasks and graciously sends you fresh copies in batches of 12 per 'update'.  It will tell you (very politely - how amazing :-). ) that it's "resending lost tasks".  Very quickly, you can have the entire cache restored to its former glory.  At that point, remove the NNT and you will be back to normal operations.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.