Seemingly stalled task (at 8.427 %)

HP-Z210
HP-Z210
Joined: 14 May 18
Posts: 6
Credit: 138124
RAC: 0
Topic 215062

Good day;

This morning one task was at 8.427 % completed with the elapsed time of over 20 hours but the time remaining was at just over 2 minutes so, although strange, I allowed it to 'finish'. When it did the task remained at 8.427 % but was using no CPU time. There is a second task which seems to be continuing as the progress % continues to rise. I am not sure what happened with this particular task but I suspended it and another task promptly took its place and is now happily crunching away. So, there are now 2 running and 1 suspended (stalled).

The task in question:

5/19/2018 8:55:42 AM | Einstein@Home | task h1_0364.40_O2C02Cl1In0__O2AS20-500_364.50Hz_178_0 suspended by user

Should I just abort it or is there some other info that I can provide you before doing so?

 

Thanks;

Peter

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118445231720
RAC: 25920118

It is possible for tasks to

It is possible for tasks to become 'stuck' occasionally from probably 'unknowable' causes.  Suspending a task may not make any difference (depends somewhat on your preference settings) as I think the default is to leave tasks in memory when suspended.  If you resume it, it will just re-commence from the same stalled 'in-memory' image (when a slot is available to re-commence it).  It's worth trying anyway.  If you resume it, BOINC will attempt to continue running it when one of the two running tasks finishes (or is itself suspended).

If it is still stuck at that point, you should fully stop BOINC, wait a few seconds and then restart BOINC.  This will guarantee that all 'in progress' tasks are restarted from the last 'checkpoint' saved on disk and not from an image saved in memory.  It may cure the problem if the 'stall' happened as a result of some glitch after the last checkpoint had been written.

If this still fails to get progress after a reasonable time, the next step should be to stop BOINC again and this time reboot the computer.  It is possible that some sort of OS glitch was the initial cause and a reboot should clear that.

Whenever this sort of thing has happened to me (and it wasn't caused by faulty hardware - that's another whole can of worms) the task has pretty much got underway again.  I can't recall a case when I've had to abort a task because it was 'stuck' permanently.  That's not really saying much because my memory is getting pretty bad these days :-).

Let us know how you go.

 

Cheers,
Gary.

HP-Z210
HP-Z210
Joined: 14 May 18
Posts: 6
Credit: 138124
RAC: 0

Thanks; I will try these

Thanks;

I will try these steps today and let you know. I have one task about 20 minutes from completion and will allow that to upload first.

Peter

HP-Z210
HP-Z210
Joined: 14 May 18
Posts: 6
Credit: 138124
RAC: 0

Hello again; Sorry about my

Hello again;

Sorry about my lack of knowledge but I am new and have only been running the Boinc client for about 6 days now.

Update on my steps to restart my task:

1: Ran boinccmd --get_tasks

Task in question showed a Final CPU time > checkpoint CPU time

3: In Boincmgr restarted the suspended task

4: Exited Boincmgr with stop running tasks checked.

5: Restarted Boincmgr

Task is now showing

Progress: 8.359 %

Status: Postponed: Waiting to acquire lock

7: Ran boinccmd --get_tasks

Task in question showed a Final CPU time = checkpoint CPU time = current CPU time

8: Opened the Event Log

5/20/2018 9:39:34 AM | Einstein@Home | task postponed 600.000000 sec: Waiting to acquire lock

5/20/2018 9:50:17 AM | Einstein@Home | task postponed 600.000000 sec: Waiting to acquire lock

5/20/2018 10:00:54 AM | Einstein@Home | task postponed 600.000000 sec: Waiting to acquire lock

9: Opened the stderr.txt file

2018-05-20 09:28:20.2395 (5256) [normal]: This program is published under the GNU General Public License, version 2
2018-05-20 09:28:20.2395 (5256) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2018-05-20 09:28:20.2395 (5256) [normal]: This Einstein@home App was built at: Apr 5 2018 14:15:53

2018-05-20 09:28:20.2495 (5256) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O2AS20-500_1.01_windows_x86_64.exe'.
Activated exception handling...
09:28:20 (5256): Can't acquire lockfile (32) - waiting 35s
09:28:55 (5256): Can't acquire lockfile (32) - exiting
09:28:55 (5256): Error: The process cannot access the file because it is being used by another process.
(0x20)
putenv 'LAL_DEBUG_LEVEL=3'

So, it appears that the task (as you indicated) has reverted back to the last checkpoint. However, after about 30 minutes the status continues to loop every 10 minutes from running (with no progress % increase) to showing waiting to acquire lock. The client Status line, client Event log and stderr.txt all show this.

I will try a system reboot next to see if the 'locked' file somehow resets.

I hope this info is not too cumbersome and is helpful in some way.

 

Peter

HP-Z210
HP-Z210
Joined: 14 May 18
Posts: 6
Credit: 138124
RAC: 0

Good day; I just finished a

Good day;

I just finished a reboot, restarted the client and the problem task is now proceeding along nicely. 

Thanks for you help;

Peter

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118445231720
RAC: 25920118

Hi Peter, I'm very happy to

Hi Peter,
I'm very happy to hear that everything is back to normal.  Thanks for setting out the steps you took.  I'll comment on a couple of things in case it might be helpful to you.  Your computers are hidden, which is absolutely fine - your choice, but it does make it impossible to investigate details about your hardware, your OS, and tasks that have been completed and returned.

When I first saw reference to boinccmd, my immediate reaction was that you must be running Linux without a GUI.  Later on in the event log snip, I see the Windows app mentioned (so you do have a perfectly serviceable GUI), so my question is, why do step 1?

HP-Z210 wrote:
1: Ran boinccmd --get_tasks

If you want to see details of all things going on with tasks, the most convenient way is to use BOINC Manager (Advanced view).  I run Linux and do use boinccmd in control scripts I write for my hosts, but the Manager interface is so much easier to use when you want to browse and control individual machines.  My suggestion is to permanently switch to the advanced view and explore what is available to you in the various menu items.  To get lots of details about a particular task, you select the Tasks tab, select a task of interest by clicking on it, and then click 'properties' to get a nicely formatted popup showing all the details.

Quote:
Task in question showed a Final CPU time > checkpoint CPU time

Do you mean Current CPU time?  That should always be > checkpoint CPU time unless you happen to be looking at the very instant the checkpoint is being written (when they would be roughly equal).  The greater the separation between the two, the more crunching time that is 'lost' if you stop BOINC and then restart from that checkpoint.

Quote:
3: In Boincmgr restarted the suspended task

I presume you selected the task and clicked a button that was saying 'Resume' before you clicked it and as a result, the status of the task changed from 'Suspended' to, perhaps 'Waiting to run', or something like that?  Did you wait to see if this task would start when one of the running tasks finished?

Quote:
4: Exited Boincmgr with stop running tasks checked.

I'm guessing you would do this (exit completely) if the 'Waiting to run' task didn't make any further progress after it actually was allowed to run?  It will continue 'waiting' until a running task finishes.  At that point, its status should change from 'Waiting' to 'Running' irrespective of whether or not it made any further progress.  Then, if you decided it really wasn't making further progress (this is where the 'properties' function I described above is very useful), it would be appropriate to stop BOINC completely, wait a bit, and then restart BOINC.

Quote:

5: Restarted Boincmgr

Task is now showing

Progress: 8.359 %

Status: Postponed: Waiting to acquire lock

7: Ran boinccmd --get_tasks

Task in question showed a Final CPU time = checkpoint CPU time = current CPU time

I have no first hand experience with lock files but here is my understanding.

On multi-core machines, it's important that two (or more) different instances of an app, running on different cores, aren't both trying to work on the same task.  Under normal conditions, when a task starts crunching, an individual 'slot directory' is created which is populated with everything needed (discrete copies or links) as appropriate.  All working and temporary files for this task (including the checkpoint file) will be created in this same directory.

To prevent any other instance of the app using this particular slot, a lock file is created which basically means the directory is in use.  A particular instance of the app 'knows' exactly which lockfile it 'owns'.  I don't fully understand how it all works, but even if you suspend a task so that a new task starts in an extra slot directory, and even if the whole machine is shut down and restarted, there can be an orderly taking up of partially completed tasks by single instances of the app once again.  Lock files are created and removed in such a way as to guarantee this (usually) :-).

There can be abnormal events (I don't know the details) where a partly completed task can be sitting in a slot with a lock file basically saying that this task is 'owned' but where there is actually no app instance that 'owns' it.  I think it is something like this that you were seeing.  In the BOINC directory, there is a sub-directory called 'slots' and in that are further subdirs called '0' '1' '2' .... as many as needed to hold all the working files for in-progress tasks.  When tasks are finished and reported, slots are cleared and reused or deleted if no longer needed.  If a slot contains a lock file when it shouldn't, then you'll have a problem giving messages like you were seeing.

From memory, I've seen (over many years) people talking about deleting lock files in slot directories as a solution.  I think this is needed when stopping and restarting BOINC doesn't clear the problem.  I also seem to recall that rebooting the machine has sometimes sorted things out.  If I'm reading your report correctly, this seems to have worked for you.

Since you spent a lot of time (very commendable with your short experience) in documenting what you observed and what you did), I thought the least I could do was try to interpret your report, both for your edification and for that of any one else interested in reading this far :-).

 

Cheers,
Gary.

HP-Z210
HP-Z210
Joined: 14 May 18
Posts: 6
Credit: 138124
RAC: 0

Hi Gary; Thanks again for

Hi Gary;

Thanks again for all the detail.

I am running Windows and will now make my computers visible.

1: Ran boinccmd --get_tasks

I do have the GUI running, however, there are some cases in applications where the command line interface gives more info so that is why I went there. (just my own curiosity sent me there) That is where I found the 3 different CPU times (Final CPU time, checkpoint CPU time and current CPU time) referenced and I could only see two in the BAM interface. I have since noticed that the Final CPU time updates along with the Current CPU time (cmd interface). I was only trying to provide extra info, albeit seemingly irrelevant, given my new knowledge. 

3: In Boincmgr restarted the suspended task

Yes. I clicked resume. The task went to waiting to run as another task started when I suspended the stalled task, however, I did not wait for any task to finish before exiting the task manager as the time to finish both running tasks was around 10-12 hours. 

 4: Exited Boincmgr with stop running tasks checked.

It was still in waiting to run when I exited. Because it was slot 0, I was pretty sure it would be the first task to start upon restarting the manager and it did. That is when the lock problem occurred. So, I exited with a partially finished, waiting to run task and two active tasks. The other two active tasks seemed to finish normally and uploaded after the slot 0 task resumed and finished (after reboot).

 

Note:

I do have the subdirs '0' '1' '2' and the suspended task was using subdir 0 and the other two subdirs were from two other running tasks. A third task (using slot 2) became active after I suspended the stalled task.

 

Have a great day;

Peter

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.