BRP3SSE work units running 'forever'

ZoSo
ZoSo
Joined: 2 Apr 10
Posts: 14
Credit: 6182189
RAC: 0
Topic 195695

Well, not literally, since I saw it happening and restarted the boinc client... but they were "stuck" around 25% to 35% (i.e. the % counter was not moving) and over 24 hours had elapsed, with the 'time left' growing click-for-click with the 'elapsed' column. I'm running 3 other projects, and none of those have exhibited this behavior. After restarting the client they dropped back to the CPU time values for 'elapsed' and Time Left was a more-reasonable 4 hours and something instead of the 29 hours+ they had grown to before restarting boinc.

Anyone else noticing similar, or should I just chalk it up to not having rebooted that machine for a few weeks?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4343
Credit: 252788259
RAC: 41097

BRP3SSE work units running 'forever'

Hard to guess the reason without further information. Could you post / send / whatever stderr.txt of the task, at least the last 25 lines or so? In case you already aborted it, please report it and post the id.

BM

BM

ZoSo
ZoSo
Joined: 2 Apr 10
Posts: 14
Credit: 6182189
RAC: 0

There is no stderr.txt file

There is no stderr.txt file in /usr/bin (where the boinc program files are), nor in /var/log (where the boinc.log files are), nor in /var/lib/boinc (its data dir, and where I would most-expect to find stderr). Still, that's not surprising to me, since there was no error detected on BOINC's part. They just 'hung' between 25% and 35% on those 2 work units, while work units on the other 2 cores from Einstein, WCG and Docking projects kept running/completing.

I didn't document which work units it was at the time... I just noticed they had been running over 24 hours, the percentage was not incrementing on those 2 at all, and the 'time left' column was increasing tick-for-tick with the 'elapsed' column... while the CPU time available in the task Properties was reading only around 4 hours (with the 'elapsed' over 24 hours and 'time left' over 29 hours). I suspended and resumed them a couple times to no effect, so finally I ran
# service boinc-client restart

When the client restarted and everything reappeared in the manager, the 'elapsed' column figures had dropped to match the CPU times and the percentage columns were incrementing again, with 'time left' estimating a little over one hour to go on the two problem work units.

From the job_log file, I'm pretty sure it was two of these four

1299495171 ue 79755.469396 ct 63270.820000 fe 140000000000000 nm PM0054_03331.dm_272_1 et 64450.021431
1299562171 ue 79755.469396 ct 58245.160000 fe 140000000000000 nm PM0055_004D1.dm_256_0 et 58882.905005
1299676812 ue 79755.469396 ct 60107.060000 fe 140000000000000 nm PM0056_00131.dm_308_1 et 63890.129445
1299691887 ue 79755.469396 ct 57603.500000 fe 140000000000000 nm PM0055_010C1.dm_276_0 et 59049.863559


but I did not note which two it was at the time. Sorry.

I didn't see anyone else complaining about it, so I thought I would post my observations and maybe if others were seeing the same thing then we could worry about forensics.

Thanks for your reply, anyway.

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

As long as the tasks are

As long as the tasks are running (or waiting to run), the stderr file is in the respective slot directory (beneath the data directory), for tasks ready to report, you'll find the stderr output in the client_state file (in the data directory).

Gruß,
Gundolf
[edit]For reported tasks, you find it online, in the "All tasks for..." lists, by clicking on the link in the "Task ID" column (222617138 for instance).[/edit]

Computer sind nicht alles im Leben. (Kleiner Scherz)

ZoSo
ZoSo
Joined: 2 Apr 10
Posts: 14
Credit: 6182189
RAC: 0

OK... I have 2 more of these,

OK... I have 2 more of these, on different machines, both quad cores; 1 AMD and 1 intel, so it's not processor dependent.

From Messages on the AMD machine:

Fri 11 Mar 2011 06:23:49 AM EST	Einstein@Home	Started download of PM0058_02071_348.binary
Fri 11 Mar 2011 06:23:49 AM EST	Einstein@Home	Started download of PM0058_02071_349.binary
Fri 11 Mar 2011 06:23:56 AM EST	Einstein@Home	Finished upload of h1_1451.25_S5R4__931_S5GC1HFa_1_0
Fri 11 Mar 2011 06:27:55 AM EST	Einstein@Home	Finished download of PM0058_02071_349.binary
Fri 11 Mar 2011 06:27:55 AM EST	Einstein@Home	Started download of PM0058_02071_350.binary
Fri 11 Mar 2011 06:28:28 AM EST	Einstein@Home	Finished download of PM0058_02071_348.binary
Fri 11 Mar 2011 06:28:28 AM EST	Einstein@Home	Started download of PM0058_02071_351.binary
Fri 11 Mar 2011 06:29:46 AM EST	Einstein@Home	Finished download of PM0058_02071_350.binary
Fri 11 Mar 2011 06:31:03 AM EST	Einstein@Home	Finished download of PM0058_02071_351.binary
Fri 11 Mar 2011 06:50:49 AM EST	Docking	Computation for task 1iiq1hps_mod0014crossdockinghiv1_2391_466717_0 finished
Fri 11 Mar 2011 06:50:49 AM EST	Einstein@Home	Starting PM0058_02071.dm_348_0
Fri 11 Mar 2011 06:50:49 AM EST	Einstein@Home	Starting task PM0058_02071.dm_348_0 using einsteinbinary_BRP3 version 105
F

Would be nice if BOINC's Messages tab noted which slot it was. But the Properties page for that workunit on the Tasks tab says Slot 2.

Slot 2's stderr.txt has not been updated for just about 31 hours.

PM0058_02071.dm_348 is the workunit name on the Tasks tab, but stderr.txt for Slot 2 indicates the application couldn't find status.cpt for DM756, then appears to start DM761 (i.e. no mention of DM348 in stderr.txt).

------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP3/PM0058_02071/PM0058_02071_DM756.00
------> Sample time in microseconds: 500
------> Observation time in seconds: 2098.793
------> Time stamp (MJD): 51033.133309443452
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 143947.465401
------> DEC (J2000): -584616.194
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4814506
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 756 cm^-3 pc
------> Scale factor: 14.9412
[06:50:52][31603][INFO ] Seed for random number generator is 1055907775.
[06:50:53][31603][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-08
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[06:51:51][31603][INFO ] Checkpoint committed!
[280 checkpoints - about 1/minute - snipped]
[11:28:11][31603][INFO ] Checkpoint committed!
[11:29:12][31603][INFO ] Checkpoint committed!
[11:29:47][31603][INFO ] Data processing finished successfully!
[11:29:47][31603][INFO ] Starting data processing...
[11:29:47][31603][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[11:29:47][31603][INFO ] Header contents:
------> Original WAPP file: /BOINC/projects/EinsteinAtHome/temp_working/BRP3/PM0058_02071/PM0058_02071_DM761.00
------> Sample time in microseconds: 500
------> Observation time in seconds: 2098.793
------> Time stamp (MJD): 51033.133309339057
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 143947.465401
------> DEC (J2000): -584616.194
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4814506
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 761 cm^-3 pc
------> Scale factor: 14.9412
[11:29:50][31603][INFO ] Seed for random number generator is -1061249734.
[11:29:51][31603][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-08
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[11:30:13][31603][INFO ] Checkpoint committed!
[85 checkpoints snipped]
[12:57:35][31603][INFO ] Checkpoint committed!
13:53:54 (31603): No heartbeat from core client for 30 sec - exiting

So... curious, huh.
PM0058_02071.dm_348_0 is the workunit name that is stalled at 33.006% (currently 5:45:30 CPU time; 29:33:30 Elapsed; 41:57:20 To Completion and climbing tick-for-tick with Elapsed). The other 3 cores are all crunching away, no problems... one of them is workunit PM0058_025B1.dm_96_1, by the way, which is at 76% after ~12 hours, with CPU and Elapsed time differing by about 2.5 minutes.

While collecting this info the 'stalled' workunit incremented to 33.007%, but I just looked at the Tasks tab again and its Progress column has decremented back to 33.006%. I will go collect these same data on the intel machine, now.

ZoSo
ZoSo
Joined: 2 Apr 10
Posts: 14
Credit: 6182189
RAC: 0

Well, this one is not a

Well, this one is not a BRP3SSE work unit.

Here are some clips from the Messages tab that might be relevant (or not)

Fri 11 Mar 2011 01:00:03 PM EST	Einstein@Home	Starting h1_1475.35_S5R4__1112_S5GC1HFa_2
Fri 11 Mar 2011 01:00:05 PM EST	Einstein@Home	Starting task h1_1475.35_S5R4__1112_S5GC1HFa_2 using einstein_S5GC1HF version 107
Fri 11 Mar 2011 06:24:19 PM EST	Einstein@Home	Sending scheduler request: To fetch work.
Fri 11 Mar 2011 06:24:19 PM EST	Einstein@Home	Requesting new tasks
Fri 11 Mar 2011 06:24:35 PM EST	Einstein@Home	Scheduler request completed: got 1 new tasks
Fri 11 Mar 2011 06:24:35 PM EST	Einstein@Home	Got server request to delete file h1_1475.35_S5R4
Fri 11 Mar 2011 06:24:35 PM EST	Einstein@Home	Got server request to delete file l1_1475.35_S5R4
Fri 11 Mar 2011 06:24:35 PM EST	Einstein@Home	Got server request to delete file l1_1475.35_S5R7
Fri 11 Mar 2011 06:24:35 PM EST	Einstein@Home	Got server request to delete file h1_1475.35_S5R7
Fri 11 Mar 2011 06:24:37 PM EST	Einstein@Home	Started download of h1_1475.95_S5R4
Fri 11 Mar 2011 06:24:37 PM EST	Einstein@Home	Started download of h1_1475.95_S5R7
Fri 11 Mar 2011 06:32:48 PM EST	Einstein@Home	Finished download of h1_1475.95_S5R7
Fri 11 Mar 2011 06:32:48 PM EST	Einstein@Home	Started download of l1_1475.95_S5R4
Fri 11 Mar 2011 06:33:06 PM EST	Einstein@Home	Finished download of h1_1475.95_S5R4
Fri 11 Mar 2011 06:33:06 PM EST	Einstein@Home	Started download of l1_1475.95_S5R7
Fri 11 Mar 2011 06:38:56 PM EST	Einstein@Home	Finished download of l1_1475.95_S5R7
Fri 11 Mar 2011 06:40:24 PM EST	Einstein@Home	Finished download of l1_1475.95_S5R4
Fri 11 Mar 2011 07:38:50 PM EST	World Community Grid	Computation for task c4cw_target03_067535919_0 finished
Fri 11 Mar 2011 07:38:50 PM EST	World Community Grid	Restarting task c4cw_target03_067535672_0 using c4cw version 640
Fri 11 Mar 2011 07:38:52 PM EST	World Community Grid	Started upload of c4cw_target03_067535919_0_0
Fri 11 Mar 2011 07:39:01 PM EST	World Community Grid	Finished upload of c4cw_target03_067535919_0_0
Fri 11 Mar 2011 07:59:57 PM EST	Docking	Sending scheduler request: To fetch work.
Fri 11 Mar 2011 07:59:57 PM EST	Docking	Reporting 1 completed tasks, requesting new tasks
Fri 11 Mar 2011 08:00:13 PM EST	Docking	Scheduler request completed: got 1 new tasks
Fri 11 Mar 2011 08:00:15 PM EST	Docking	Started download of 1iiq1hps_mod0014crossdockinghiv1_19484_404357.inp
Fri 11 Mar 2011 08:01:20 PM EST	Docking	Finished download of 1iiq1hps_mod0014crossdockinghiv1_19484_404357.inp
Fri 11 Mar 2011 08:03:19 PM EST	Einstein@Home	Sending scheduler request: To fetch work.
Fri 11 Mar 2011 08:03:19 PM EST	Einstein@Home	Requesting new tasks
Fri 11 Mar 2011 08:03:30 PM EST	Einstein@Home	Scheduler request completed: got 1 new tasks
Fri 11 Mar 2011 08:08:19 PM EST	Einstein@Home	Resuming task h1_1475.35_S5R4__1112_S5GC1HFa_2 using einstein_S5GC1HF version 107
Fri 11 Mar 2011 08:13:30 PM EST	World Community Grid	update requested by user
Fri 11 Mar 2011 08:13:33 PM EST	World Community Grid	Sending scheduler request: Requested by user.
Fri 11 Mar 2011 08:13:33 PM EST	World Community Grid	Reporting 4 completed tasks, not requesting new tasks
Fri 11 Mar 2011 08:13:51 PM EST	World Community Grid	Scheduler request completed
Fri 11 Mar 2011 08:23:21 PM EST	Docking	Computation for task 1iiq1hps_mod0014crossdockinghiv1_15860_290197_0 finished
Fri 11 Mar 2011 08:23:21 PM EST	Docking	Resuming task 1iiq1hps_mod0014crossdockinghiv1_18029_428779_0 using charmm34 version 623

and the last half or so of stderr.txt from Slot 3:

2011-03-11 13:00:26.0029 (11052) [debug]: Flags: LAL_NDEBUG, HS_OPTIMIZATION, i386, SSE, SSE2, GNUC
2011-03-11 13:00:26.0029 (11052) [debug]: glibc version/release: 2.11.2/stable
2011-03-11 13:00:26.0031 (11052) [debug]: Set up communication with graphics process.
Code-version: %% LAL: 6.4.1.4 (CLEAN 3100deade3960bdf05a7f40afb704450505f5461)
%% LALPulsar: 1.0.0.4 (CLEAN 3100deade3960bdf05a7f40afb704450505f5461)
%% LALApps: 6.4.0.4 (CLEAN 3100deade3960bdf05a7f40afb704450505f5461)

2011-03-11 13:00:26.0991 (11052) [normal]: Reading input data ... done.
% --- GPS reference time = 847063082.5000 , GPS data mid time = 847063082.5000
% --- Setup, N = 205, T = 90000s, Tobs = 56435059s, gammaRefine = 1399.000000
2011-03-11 13:00:55.9748 (11052) [normal]: INFO: No checkpoint h1_1475.35_S5R4__1112_S5GC1HFa_2_0.cpt found - starting from scratch
% --- Cpt:0, total:834, sky:1/139, f1dot:1/6
2011-03-11 13:00:55.9811 (11052) [normal]: 1/1
% --- CG:9881 FG:10423949 f1dotmin_fg:-2.931052841924e-09 df1dot_fg:4.128328706926e-13
2011-03-11 13:01:24.5713 (11052) [normal]: 1/2
c
2011-03-11 13:01:53.2790 (11052) [normal]: 1/3
2011-03-11 13:02:22.0603 (11052) [normal]: 1/4
2011-03-11 13:02:50.6205 (11052) [normal]: 1/5
c
2011-03-11 13:03:19.1333 (11052) [normal]: 1/6
2011-03-11 13:03:47.5268 (11052) [normal]: 2/1
2011-03-11 13:04:16.2886 (11052) [normal]: 2/2
c
2011-03-11 13:04:45.1018 (11052) [normal]: 2/3
2011-03-11 13:05:13.9517 (11052) [normal]: 2/4
2011-03-11 13:05:42.5781 (11052) [normal]: 2/5
c
2011-03-11 13:06:11.4893 (11052) [normal]: 2/6
2011-03-11 13:06:40.0906 (11052) [normal]: 3/1
2011-03-11 13:07:08.8550 (11052) [normal]: 3/2
c
2011-03-11 13:07:38.6075 (11052) [normal]: 3/3
2011-03-11 13:08:07.8440 (11052) [normal]: 3/4
2011-03-11 13:08:36.6970 (11052) [normal]: 3/5
c
2011-03-11 13:09:05.4515 (11052) [normal]: 3/6
2011-03-11 13:09:34.5012 (11052) [normal]: 4/1
2011-03-11 13:10:03.4345 (11052) [normal]: 4/2
c
2011-03-11 13:10:31.7881 (11052) [normal]: 4/3
2011-03-11 13:11:00.6609 (11052) [normal]: 4/4
2011-03-11 13:11:29.4923 (11052) [normal]: 4/5
c
2011-03-11 13:11:58.9732 (11052) [normal]: 4/6
2011-03-11 13:12:28.1387 (11052) [normal]: 5/1
2011-03-11 13:12:57.5720 (11052) [normal]: 5/2
c
2011-03-11 13:13:26.8594 (11052) [normal]: 5/3
2011-03-11 13:13:56.1557 (11052) [normal]: 5/4
2011-03-11 13:14:25.2750 (11052) [normal]: 5/5
c
2011-03-11 13:14:54.5280 (11052) [normal]: 5/6
2011-03-11 13:15:23.5438 (11052) [normal]: 6/1
2011-03-11 13:15:52.8144 (11052) [normal]: 6/2
c
2011-03-11 13:16:21.5323 (11052) [normal]: 6/3
2011-03-11 13:16:49.8570 (11052) [normal]: 6/4
2011-03-11 13:17:18.3290 (11052) [normal]: 6/5
c
[275 lines snipped]
2011-03-11 14:57:49.3414 (11052) [normal]: 41/3
2011-03-11 14:58:18.6670 (11052) [normal]: 41/4
2011-03-11 14:58:48.1837 (11052) [normal]: 41/5
c
2011-03-11 14:59:17.4339 (11052) [normal]: 41/6
2011-03-11 14:59:46.6041 (11052) [normal]: 42/1
2011-03-11 15:00:15.1691 (11052) [normal]: 42/2
c
2011-03-11 15:00:44.6474 (11052) [normal]: 42/3
17:52:58 (11052): No heartbeat from core client for 30 sec - exiting
[New Thread 0x636b70 (LWP 11058)]
warning: .dynamic section for "/lib/libgcc_s.so.1" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations

So, different causes, same symptom?
Because this workunit is stalled at 29.736% (CPU Time 1:58:05; Elapsed 29:21:20; To Completion 45:18:13), the Slot 3 stderr.txt has not been updated for 32.5 hours.

edit1: Forgot to mention, that despite the stderr.txt message above, h1_1475.35_S5R4__1112_S5GC1HFa_2_0.cpt does indeed exist in the same subdir as the stderr.txt file.
____________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.