WU seems stuck

Jim Wilkins
Jim Wilkins
Joined: 1 Jun 05
Posts: 33
Credit: 28426884
RAC: 0
Topic 198348

I am running a FGRP4 WU on my iMac with BOINC 7.6.22 and Mac OS 10.11.2. Is has been stuck at 85.833% complete for two wall clock days now. The remaining time cycles between about 01:21:00 and 01:24:00. What should I do, if anything?

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

WU seems stuck

Have you tried exiting Boinc and then restarting it to see if it picks up where it left off or finishes?

If that doesn't work, you could try rebooting your computer and then relaunch BOINC and see if it continues and finishes.

If after both of those it doesn't finish, then I would consider aborting the task.

Jim Wilkins
Jim Wilkins
Joined: 1 Jun 05
Posts: 33
Credit: 28426884
RAC: 0

Zaister, I have done both

Zaister,

I have done both of these things. Nothing good happens. I think this is the second one of these that I have had to abort.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Looks like those are beta

Looks like those are beta versions.

Think the developers are going to need to look at those and the stderr reports and try to figure out what happen.

Cartoonman
Cartoonman
Joined: 5 May 08
Posts: 9
Credit: 96637220
RAC: 74489

I just saw the same happen to

I just saw the same happen to my BRP6 Cuda beta tasks. It seems like when I suspend them, when I resume the task it fails to restart and just hangs. Someone confirm?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117878791484
RAC: 34729487

RE: Looks like those are

Quote:

Looks like those are beta versions.

Think the developers are going to need to look at those and the stderr reports and try to figure out what happen.


Whilst this app is listed as -beta for OS X, it seems to have been running for a while without issue. I'm using it on a couple if iMacs and I'm not seeing any problems with current tasks.

In general terms and over time, quite a few tasks do fail for a variety of reasons and most of those are to do with issues at the client end which would be virtually impossible for the Devs to diagnose. If a volunteer wants to attempt a diagnosis, the stderr text returned to the project can be browsed by anybody by clicking on the appropriate task ID on the website.

I had a look at this for both the aborted tasks and compared what I saw with the stderr output of a successful task. For the first task aborted, I didn't notice any particular issue. The task seemed to have reached about the 50% stage (118 skypoints out of 238 total) in about 45-50% of the normal crunch time. It just looked like the task was aborted in mid flight.

For the most recent abort, there was some sort of issue. The stderr output was truncated at the beginning with only those lines closest to the end of the file being returned. This happens because there is a size limit imposed on the file and if something goes wrong and produces lots and lots of garbage, the stuff near the end of the file is deemed the most important. In this case a block of output as listed below was being repeated ad nauseum so the task really did need to be aborted.

16:19:46 (5704): [normal]: Start of BOINC application 'hsgamma_FGRP4_1.15_x86_64-apple-darwin__FGRP4-Beta'.
16:19:46 (5704): [debug]: 2.1e+15 fp, 5.1e+09 fp/s, 407832 s, 113h17m11s85
command line: hsgamma_FGRP4_1.15_x86_64-apple-darwin__FGRP4-Beta --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0138E.dat --alpha 2.34716136237 --delta -0.748223993186 --skyRadius 1.722261e-03 --ldiBins 15 --f0start 48 --f0Band 32 --firstSkyPoint 168 --numSkyPoints 24 --f1dot -5e-11 --f1dotBand 1e-12 --df1dot 4.573937176e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55946 --f0orbit 0.005 --debug 1 -o LATeah0138E_80.0_168_-4.9e-11_1_0.out
output files: 'LATeah0138E_80.0_168_-4.9e-11_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0138E_80.0_168_-4.9e-11_1_0' 'LATeah0138E_80.0_168_-4.9e-11_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0138E_80.0_168_-4.9e-11_1_1'
16:19:46 (5704): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
16:19:46 (5704): [normal]: WARNING: Resultfile '../../projects/einstein.phys.uwm.edu/LATeah0138E_80.0_168_-4.9e-11_1_0' present - doing nothing
16:19:46 (5704): [debug]: Set up communication with graphics process.
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out.cohfu: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out.cohfu: No such file or directory
mv: LATeah0138E_80.0_168_-4.9e-11_1_0.out.cohfu: No such file or directory
16:23:05 (6219): [normal]: This Einstein@home App was built at: Sep 25 2015 08:56:27

Because the start of the stderr file is lost, we can't see what was happening before the above rename failures (mv command) kicked in. However we can know that the task had the first half completed. These tasks are bundles of two and you can see the line with the WARNING that says that there is an existing part result which will not be disturbed in any way. Notice that the part result mentioned has the extension _0 which is used for the 1st half result. The 2nd half has a _1 extension. The two files containing the first half result would have endings of _0.out and _0.out.cohfu (coherent followup).

Because the app was attempting a restart and was trying to move (rename) these files, I'm guessing that something bad had happened at the end of the first half which caused the result files not to be created properly. The OP would be the only person who could figure out what was happening on the machine leading up to the above that could have caused the 1st half result files to be lost or damaged. I could imagine something like this happening if some other process on the machine were to crash or cause a lockup which required some sort of hard reboot at a most inopportune time. Perhaps part of the filesystem was damaged and subsequently got truncated by the filesystem consistency checks during a reboot. As I said, I'm only guessing. It would be virtually impossible for an outsider to diagnose the true cause of the problem.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117878791484
RAC: 34729487

RE: I just saw the same

Quote:
I just saw the same happen to my BRP6 Cuda beta tasks. It seems like when I suspend them, when I resume the task it fails to restart and just hangs. Someone confirm?


The GPU app for BRP6 tasks on nvidia GPUs is totally different to the CPU app for FGRP4 tasks. It is quite unlikely that your 'problem' will be connected to that of the OP. You are more likely to get help if you start your own thread and take a bit of time to explain the circumstances surrounding the particular behaviour you were observing.

If you don't give any background information, people have to trawl through your results and try to guess what might have been happening. For example, the bulk of your GPU tasks seem to take around 8-9 ksecs or so, in elapsed time. There are some outliers, on both the high and low side (eg. 7k or 11k) and this is pretty much what you would expect. What is not normal is to see values around 29 ksecs, like your task in this particular WU quorum. That task was validated so there was no problem with the app and its ability to crunch the data but something must have been interfering with the computation to slow it down so much.

Probably that same interference was happening to the task that was the first in your string of aborts. It had racked up 28k elapsed time when you aborted it. If you had allowed it to complete, it may very well have done so and been validated. I didn't notice any error messages so my guess is that whatever was slowing it down is probably something else using your GPU. Can you tell us what sort of other things run on your host while crunching is going on in the background?

I've picked out a couple of bits from the linked stderr output of the aborted task that might help with understanding what was going on. Please note that these tasks are 'bundled' so the stderr output should show 2 separate sub-tasks starting and finishing.

Firstly, here is an example of what you expect to see at the start of the crunching phase of each subtask. There are earlier lines of 'task setup' that I've omitted.

....
[20:53:42][2608][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 360 MB (665 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 118 MB
[20:54:39][2608][INFO ] Checkpoint committed!
[20:55:39][2608][INFO ] Checkpoint committed!
[20:56:39][2608][INFO ] Checkpoint committed!
....

When the crunching of a sub-task has completed, the following is an example of the successful completion of this stage.

....
[01:49:20][4312][INFO ] Checkpoint committed!
[01:50:29][4312][INFO ] Checkpoint committed!
[01:51:44][4312][INFO ] Checkpoint committed!
[01:51:46][4312][INFO ] Statistics: count dirty SumSpec pages 156889 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1886937
[01:51:46][4312][INFO ] Data processing finished successfully!
....

If there are no interruption to the crunching process, you would expect to see exactly two starts and finishes of the above form and there would be two continuous sections of "Checkpoint committed!" messages for each of the sub-tasks. When crunching is interrupted by BOINC being stopped, tasks being suspended, or by the host being rebooted, you will see the flow of messages being interrupted and broken into smaller sections. All the lines are time stamped so you should be able to work out the elapsed crunching times and the total interval when crunching was stopped. Perhaps this might help you remember what you were doing with the machine at these times.

Below is an example of crunching stopping and then being restarted at a later stage.

....
[21:00:41][2608][INFO ] Checkpoint committed!
[21:01:41][2608][INFO ] Checkpoint committed!
[21:02:42][2608][INFO ] Checkpoint committed!
Activated exception handling...
[23:13:17][1572][INFO ] Starting data processing...
[23:13:17][1572][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 178 MB (847 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[23:13:17][1572][INFO ] Using CUDA device #0 "GeForce GTX 750" (512 CUDA cores / 1164.29 GFLOPS)
[23:13:17][1572][INFO ] Version of installed CUDA driver: 7050
[23:13:17][1572][INFO ] Version of CUDA driver API used: 3020
[23:13:18][1572][INFO ] Continuing work on ../../projects/einstein.phys.uwm.edu/PM0080_04991_62.bin4 at template no. 34549
....
....
....
[23:13:21][1572][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 295 MB (730 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 117 MB
[23:14:17][1572][INFO ] Checkpoint committed!
[23:15:25][1572][INFO ] Checkpoint committed!
[23:16:25][1572][INFO ] Checkpoint committed!
....

I've left out a bunch of lines during the setup phase but notice the time gap between the last checkpoint (line 3) and the restart on line 5 - just over 2hrs 10mins. Looks like BOINC was stopped for a while or perhaps the machine was off for that period. None of this is a problem. I'm just indicating how you can use the time stamps to help remember what was going on external to crunching.

The next snippet is a further example of a break in crunching of about an hour.

....
[23:41:26][1572][INFO ] Checkpoint committed!
Activated exception handling...
[00:41:59][4312][INFO ] Starting data processing...
[00:41:59][4312][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 944 MB (81 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 0 MB
....
....

[00:42:04][4312][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 957 MB (68 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 13 MB
[00:42:59][4312][INFO ] Checkpoint committed!
[00:43:59][4312][INFO ] Checkpoint committed!
....

After the above restart, crunching lasted about 70 mins before the successful completion of the first sub-task was recorded.

The thing that caught my eye was the 944MB (81 MB free) memory values at the start of GPU setup followed by 957 MB (68 MB free) when GPU setup was complete, and the fact that the application would only be using 13 MB. I know nothing about what these values really mean or how they are derived, but the differences compared to what you see in 'normal running' tasks might be related to why the task might be slow running here.

The successful completion of this stage was announced by the line - "[01:51:46][4312][INFO ] Data processing finished successfully!". If you add up the elapsed times of all the processing to this point, you get a total of around 109 mins. A 'normal' task takes perhaps 135-140 mins so 110 mins for a half task is rather slow. It would appear that the slow crunching was occurring in this stage where the curious memory values were reported.

Following the above, the 2nd sub-task was commenced and crunching progressed for a further 4 hours without completion. At this point the task was aborted. Crunching is obviously very slow but the task would probably have completed successfully, if left alone and not aborted.

Obviously, slow crunching like this is not acceptable. It would appear the cause is not the app itself. It very much looks like something else is causing the problem and the app is simply being prevented from running at the proper speed. Perhaps it's something uncommon since most of your tasks seem to run at the proper speed.

Unfortunately, you'll have to provide more information or do some experiments to work out what that might be. You mentioned suspending and resuming tasks. Were you suspending in order to run something GPU intensive? If so, are you sure the GPU memory is not still tied up in some way when you restart crunching?

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.