Status - Error while computing

BernD
BernD
Joined: 18 Mar 16
Posts: 1
Credit: 9342838
RAC: 0
Topic 198611

I had 4 tasks uploaded 05/16/16, all showed status: Compute error ... but all 4 had suffered from an automatic WIN10 update and restart. Is that kind of reboot a common problem with BOINC; or is there a fix? For now, I have turned off Win auto-update.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118567940911
RAC: 19141626

Status - Error while computing

Quote:
I had 4 tasks uploaded 05/16/16, all showed status: Compute error ...


Hi BernD,

Welcome to the Einstein Project.

I had a look at your tasks list and found the 4 tasks you mention. They show on this current page of your results - for the moment anyway. I chose the first one that shows in the list as being reported at 16 May 2016, 11:55:46 UTC. I clicked on the taskID (558291648) to see what was reported.

Here is the relevant bit that was right at the bottom of the page.

2016-05-16 07:45:58.3094 (3980) [normal]: Finished main analysis.
2016-05-16 07:45:58.3094 (3980) [normal]: Recalculating statistics for the final toplist...
2016-05-16 07:50:15.0752 (3980) [normal]: Finished recalculating toplist statistics.
2016-05-16 07:50:15.0752 (3980) [debug]: Writing output ... toplist2 ... toplist3 ... done.
2016-05-16 07:50:16.3565 (3980) [debug]: resultfile '../../projects/einstein.phys.uwm.edu/h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_0' (len 88), current config file: 0
2016-05-16 07:50:16.3565 (3980) [debug]: renaming '../../projects/einstein.phys.uwm.edu/h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_0-BSGLtL' to '../../projects/einstein.phys.uwm.edu/h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_1'
2016-05-16 07:50:16.3721 (3980) [debug]: renaming '../../projects/einstein.phys.uwm.edu/h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_0-BtSGLtL' to '../../projects/einstein.phys.uwm.edu/h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_2'
FPU status flags:  COND_2 PRECISION
2016-05-16 07:50:16.3721 (3980) [normal]: done. calling boinc_finish(0).
07:50:16 (3980): called boinc_finish

upload failure:
h1_0093.60_O1C01Cl2In2__O1AS20-100I_93.70Hz_1661_0_0
-161 (not found)

h1_0093.60_O1C01Cl2In2__O1AS20-100I_93.70Hz_1661_0_1
-161 (not found)

h1_0093.60_O1C01Cl2In2__O1AS20-100I_93.70Hz_1661_0_2
-161 (not found)

As you can see, the computation was actually completed successfully at 07:50:16. that will be a local time so you would need to take account of your timezone in working out the relation between that time and the reported UTC time on the server.

What was missing appears to be the three result files. The error reported is that those three files were unable to be found. The really puzzling bit is that the name of this task is h1_0093.60_O1C01Cl2In2__O1AS20-100I_93.70Hz_1661_0 (right at the top of the page when you click the above link for taskID 558291648 and yet the creation of the three result files in the above snip (just before the line giving FPU status flags) shows totally different filenames, the first one being h1_0093.25_O1C01Cl2In2__O1AS20-100I_93.35Hz_845_0_0 followed by the extra two files with same wrong 'base' part of the name and _1 and _2 extensions. Notice that the 1st frequency (0093.60Hz) has morphed into something totally different (0093.25Hz) and the sequence number suffix (_1661) has also changed (_845). If you look at all 4 error tasks, there is exactly the same pattern. So, for each error task, the 3 result files couldn't be found because the 'base' parts of the names were quite wrong.

It gets even more weird. I wondered where the wrong 'base' parts of the names had come from. It turns out you completed and returned (and had validated) 4 tasks which were sent to you on May 06 and reported on May 12. Have a look earlier in your tasks list at taskIDs 558029896, 558029897, 558029899 and 558029901 respectively. You will find the exact names in those previous tasks that are being used again in the results for the failed tasks. I have no idea how the names of files reported and removed from your system 4 days earlier could suddenly be reused as the names for the current set of results. Perhaps a Dev needs to look into this and see if there is any explanation.

Christian - are you reading?? :-).

Cheers,
Gary.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 196918843
RAC: 227125

A very curious case. I

A very curious case. I checked the 4 tasks in question and found a pattern by comparing the command line used to start the application.

It seems that something happened before all of the four tasks where restarted at 2016-05-13 19:21 (local time, UTC-4) because after that the command line contains the wrong taskname. One of the results was restarted at 2016-05-13 18:32:06 where everything was fine.

I'm not sure what could cause this. It's possible that the client_state.xml got corrupted so that the older command line (which should not have been in there anymore) was used to restart the newer tasks. I would need to look up how this works in the Client. But I would really like to know what happened in this one hour window. Did you restart you computer? Did you install Windows updates? Did you restore from a backup or did your revert to a previous state in Windows System Restore?

There are only four tasks reported after the faulty four and they don't restart as often as the faulty ones. So I'm not sure this is over.

After a closer look at the commandline it seems that the problem (whatever it is) only changed the name of the outputfile, the remaining commandline was still the same as it should be for this task.This means we are closer to where the problem happens but still it is still unclear how it is triggered.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.