Well, not literally, since I saw it happening and restarted the boinc client... but they were "stuck" around 25% to 35% (i.e. the % counter was not moving) and over 24 hours had elapsed, with the 'time left' growing click-for-click with the 'elapsed' column. I'm running 3 other projects, and none of those have exhibited this behavior. After restarting the client they dropped back to the CPU time values for 'elapsed' and Time Left was a more-reasonable 4 hours and something instead of the 29 hours+ they had grown to before restarting boinc.
Anyone else noticing similar, or should I just chalk it up to not having rebooted that machine for a few weeks?
Copyright © 2024 Einstein@Home. All rights reserved.
BRP3SSE work units running 'forever'
)
Hard to guess the reason without further information. Could you post / send / whatever stderr.txt of the task, at least the last 25 lines or so? In case you already aborted it, please report it and post the id.
BM
BM
There is no stderr.txt file
)
There is no stderr.txt file in /usr/bin (where the boinc program files are), nor in /var/log (where the boinc.log files are), nor in /var/lib/boinc (its data dir, and where I would most-expect to find stderr). Still, that's not surprising to me, since there was no error detected on BOINC's part. They just 'hung' between 25% and 35% on those 2 work units, while work units on the other 2 cores from Einstein, WCG and Docking projects kept running/completing.
I didn't document which work units it was at the time... I just noticed they had been running over 24 hours, the percentage was not incrementing on those 2 at all, and the 'time left' column was increasing tick-for-tick with the 'elapsed' column... while the CPU time available in the task Properties was reading only around 4 hours (with the 'elapsed' over 24 hours and 'time left' over 29 hours). I suspended and resumed them a couple times to no effect, so finally I ran
# service boinc-client restart
When the client restarted and everything reappeared in the manager, the 'elapsed' column figures had dropped to match the CPU times and the percentage columns were incrementing again, with 'time left' estimating a little over one hour to go on the two problem work units.
From the job_log file, I'm pretty sure it was two of these four
but I did not note which two it was at the time. Sorry.
I didn't see anyone else complaining about it, so I thought I would post my observations and maybe if others were seeing the same thing then we could worry about forensics.
Thanks for your reply, anyway.
As long as the tasks are
)
As long as the tasks are running (or waiting to run), the stderr file is in the respective slot directory (beneath the data directory), for tasks ready to report, you'll find the stderr output in the client_state file (in the data directory).
Gruß,
Gundolf
[edit]For reported tasks, you find it online, in the "All tasks for..." lists, by clicking on the link in the "Task ID" column (222617138 for instance).[/edit]
Computer sind nicht alles im Leben. (Kleiner Scherz)
OK... I have 2 more of these,
)
OK... I have 2 more of these, on different machines, both quad cores; 1 AMD and 1 intel, so it's not processor dependent.
From Messages on the AMD machine:
Would be nice if BOINC's Messages tab noted which slot it was. But the Properties page for that workunit on the Tasks tab says Slot 2.
Slot 2's stderr.txt has not been updated for just about 31 hours.
PM0058_02071.dm_348 is the workunit name on the Tasks tab, but stderr.txt for Slot 2 indicates the application couldn't find status.cpt for DM756, then appears to start DM761 (i.e. no mention of DM348 in stderr.txt).
So... curious, huh.
PM0058_02071.dm_348_0 is the workunit name that is stalled at 33.006% (currently 5:45:30 CPU time; 29:33:30 Elapsed; 41:57:20 To Completion and climbing tick-for-tick with Elapsed). The other 3 cores are all crunching away, no problems... one of them is workunit PM0058_025B1.dm_96_1, by the way, which is at 76% after ~12 hours, with CPU and Elapsed time differing by about 2.5 minutes.
While collecting this info the 'stalled' workunit incremented to 33.007%, but I just looked at the Tasks tab again and its Progress column has decremented back to 33.006%. I will go collect these same data on the intel machine, now.
Well, this one is not a
)
Well, this one is not a BRP3SSE work unit.
Here are some clips from the Messages tab that might be relevant (or not)
and the last half or so of stderr.txt from Slot 3:
2011-03-11 13:00:26.0991 (11052) [normal]: Reading input data ... done.
% --- GPS reference time = 847063082.5000 , GPS data mid time = 847063082.5000
% --- Setup, N = 205, T = 90000s, Tobs = 56435059s, gammaRefine = 1399.000000
2011-03-11 13:00:55.9748 (11052) [normal]: INFO: No checkpoint h1_1475.35_S5R4__1112_S5GC1HFa_2_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6
2011-03-11 13:00:55.9811 (11052) [normal]: 1/1
% --- CG:9881 FG:10423949 f1dotmin_fg:-2.931052841924e-09 df1dot_fg:4.128328706926e-13
2011-03-11 13:01:24.5713 (11052) [normal]: 1/2
c
2011-03-11 13:01:53.2790 (11052) [normal]: 1/3
2011-03-11 13:02:22.0603 (11052) [normal]: 1/4
2011-03-11 13:02:50.6205 (11052) [normal]: 1/5
c
2011-03-11 13:03:19.1333 (11052) [normal]: 1/6
2011-03-11 13:03:47.5268 (11052) [normal]: 2/1
2011-03-11 13:04:16.2886 (11052) [normal]: 2/2
c
2011-03-11 13:04:45.1018 (11052) [normal]: 2/3
2011-03-11 13:05:13.9517 (11052) [normal]: 2/4
2011-03-11 13:05:42.5781 (11052) [normal]: 2/5
c
2011-03-11 13:06:11.4893 (11052) [normal]: 2/6
2011-03-11 13:06:40.0906 (11052) [normal]: 3/1
2011-03-11 13:07:08.8550 (11052) [normal]: 3/2
c
2011-03-11 13:07:38.6075 (11052) [normal]: 3/3
2011-03-11 13:08:07.8440 (11052) [normal]: 3/4
2011-03-11 13:08:36.6970 (11052) [normal]: 3/5
c
2011-03-11 13:09:05.4515 (11052) [normal]: 3/6
2011-03-11 13:09:34.5012 (11052) [normal]: 4/1
2011-03-11 13:10:03.4345 (11052) [normal]: 4/2
c
2011-03-11 13:10:31.7881 (11052) [normal]: 4/3
2011-03-11 13:11:00.6609 (11052) [normal]: 4/4
2011-03-11 13:11:29.4923 (11052) [normal]: 4/5
c
2011-03-11 13:11:58.9732 (11052) [normal]: 4/6
2011-03-11 13:12:28.1387 (11052) [normal]: 5/1
2011-03-11 13:12:57.5720 (11052) [normal]: 5/2
c
2011-03-11 13:13:26.8594 (11052) [normal]: 5/3
2011-03-11 13:13:56.1557 (11052) [normal]: 5/4
2011-03-11 13:14:25.2750 (11052) [normal]: 5/5
c
2011-03-11 13:14:54.5280 (11052) [normal]: 5/6
2011-03-11 13:15:23.5438 (11052) [normal]: 6/1
2011-03-11 13:15:52.8144 (11052) [normal]: 6/2
c
2011-03-11 13:16:21.5323 (11052) [normal]: 6/3
2011-03-11 13:16:49.8570 (11052) [normal]: 6/4
2011-03-11 13:17:18.3290 (11052) [normal]: 6/5
c
[275 lines snipped]
2011-03-11 14:57:49.3414 (11052) [normal]: 41/3
2011-03-11 14:58:18.6670 (11052) [normal]: 41/4
2011-03-11 14:58:48.1837 (11052) [normal]: 41/5
c
2011-03-11 14:59:17.4339 (11052) [normal]: 41/6
2011-03-11 14:59:46.6041 (11052) [normal]: 42/1
2011-03-11 15:00:15.1691 (11052) [normal]: 42/2
c
2011-03-11 15:00:44.6474 (11052) [normal]: 42/3
17:52:58 (11052): No heartbeat from core client for 30 sec - exiting
[New Thread 0x636b70 (LWP 11058)]
warning: .dynamic section for "/lib/libgcc_s.so.1" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
So, different causes, same symptom?
Because this workunit is stalled at 29.736% (CPU Time 1:58:05; Elapsed 29:21:20; To Completion 45:18:13), the Slot 3 stderr.txt has not been updated for 32.5 hours.
edit1: Forgot to mention, that despite the stderr.txt message above, h1_1475.35_S5R4__1112_S5GC1HFa_2_0.cpt does indeed exist in the same subdir as the stderr.txt file.
____________