Immediate Computation error with Gravitational Wave search O1 all-sky tuning v1.00

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7338121687
RAC: 2293546

I browsed around a little

I browsed around a little among quorum partners, and soon found that my Stoll8 host is not alone in having the new (to me) 114 (0x72) exit status with data gap or overlap.

somebody else error task 1
somebody else error task 2
somebody else error task 3

These three are all from one host owned by user Thomander

also one single WU has generated this type of error on at least three different hosts:
same WU on host 1
same WU on host 2
same WU on host 3

Possibly this hints that there is a batch of WUs formed in a way that is incompatible with at least a subset of currently active hosts in this way.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

archae86 see Christians

archae86 see Christians second bullet point in the message right before your 2 consecutive messages about error 114, it's because of a misconfiguration in the work generator that's been fixed now. Already generated tasks have to go through the system but should clear pretty quickly.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7338121687
RAC: 2293546

Holmis--regarding error 114,

Holmis--regarding error 114, I see now. Unfortunately too late to edit my useless posts.

Thanks.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7338121687
RAC: 2293546

RE: * The problem with

Quote:
* The problem with missing result files was also addressed and should not happen for tasks that are send out today.

The first of my three trial tuning files run today ran seemingly with normal progress for ten hours until a slight pause at the indicated 99.000% completion point, then errored.

Perhaps this is just the known missing result file problem.
The WU was created 11 Feb 2016, 12:56:55 UTC
The Task was created 12 Feb 2016, 8:59:41 UTC
Here is some text from the end of stderr:

upload failure: 
  h1_0029.00_O1C01Cl1In1__O1AS20-100T_29.05Hz_246_3_1
  -161 (not found)

h1_0029.00_O1C01Cl1In1__O1AS20-100T_29.05Hz_246_3_2
-161 (not found)


Here is a link to the task page

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7338121687
RAC: 2293546

The second of my three hosts

The second of my three hosts to finish a trial tuning job obtained after the first batch of fixes went in ran to 99% with normal-looking progess, then hung at exactly 99.000 % claimed completion for a little under five minutes, then completed with error status with two files indicated as "not found":

upload failure: 
  h1_0021.80_O1C01Cl1In1__O1AS20-100T_21.85Hz_171_2_1
  -161 (not found)

h1_0021.80_O1C01Cl1In1__O1AS20-100T_21.85Hz_171_2_2
-161 (not found)


Over on the Technical News thread both Betreger and robl have reported similar-manifesting failures, and I found more just by back-tracing quorum partners to find hosts actually running this beta. What I did not find was any apparently successful completions yet, though it is early times. For myself, I plan to wait out the weekend before trying again, unless I see a post indicating better hope.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197248480
RAC: 55462

The problem with the missing

The problem with the missing result files persisted for an unknown reason. I stopped distribution of O1AS20-100T tasks again until we can assess the situation on Monday.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Yes, my first task just

Yes, my first task just errored out

Here is the link

https://einsteinathome.org/task/545187574

and here is the last portion of the stderr

Quote:


2016-02-13 02:28:45.4898 (5888) [normal]: Finished main analysis.
2016-02-13 02:28:45.4898 (5888) [normal]: Recalculating statistics for the final toplist...
2016-02-13 02:37:09.0030 (4548) [normal]: This program is published under the GNU General Public License, version 2
2016-02-13 02:37:09.0030 (4548) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2016-02-13 02:37:09.0030 (4548) [normal]: This Einstein@home App was built at: Feb 11 2016 16:21:10

2016-02-13 02:37:09.0030 (4548) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O1AS20-100T_1.01_windows_intelx86__SSE2.exe'.
Activated exception handling...
2016-02-13 02:37:09.0030 (4548) [debug]: Flags: LAL_NDEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, i386, SSE, SSE2, GNUC X86 GNUX86
2016-02-13 02:37:09.0030 (4548) [debug]: Set up communication with graphics process.
command line: projects/einstein.phys.uwm.edu/einstein_O1AS20-100T_1.01_windows_intelx86__SSE2.exe --refTime=1132729647.5 --Freq=94.95 --FreqBand=0.05 --dFreq=8.271845945e-07 --f1dot=-2.64248266531e-09 --f1dotBand=2.9e-09 --df1dot=1.3366502e-11 --gammaRefine=100 --computeBSGL --BSGLlogcorr=0 --Fstar0=65.826 --oLGX=0.001,0.001 --nCand1=10000 --SortToplist=6 --recalcToplistStats=1 -o ../../projects/einstein.phys.uwm.edu/h1_0094.85_O1C01Cl1In1__O1AS20-100T_94.95Hz_2478_4_0 --printCand1 --semiCohToplist --ephemE=../../projects/einstein.phys.uwm.edu/earth00-19-DE405.dat --ephemS=../../projects/einstein.phys.uwm.edu/sun00-19-DE405.dat --segmentList=../../projects/einstein.phys.uwm.edu/O1AS20-100T.seg --FstatMethod=ResampBest --FstatMethodRecalc=DemodBest --numSkyPartitions=2479 --partitionIndex=2478 --gridType=3 --skyGridFile=../../projects/einstein.phys.uwm.edu/skygrid_GC_m0.001_0095Hz_O1AS20-100.dat --loudestSegOutput --getMaxFperSeg --peakThrF=2.6 --DataFiles1=..\..\projects\einstein.phys.uwm.edu\h1_0094.85_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0094.85_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\h1_0094.90_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0094.90_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\h1_0094.95_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0094.95_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\h1_0095.00_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0095.00_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\h1_0095.05_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0095.05_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\h1_0095.10_O1C01Cl1In1;..\..\projects\einstein.phys.uwm.edu\l1_0095.10_O1C01Cl1In1
Code-version: %% LAL: 6.15.2.1 (CLEAN 213c43dd3817dc12ef5ecc9501ed749773e3e921)
%% LALPulsar: 1.12.0.1 (CLEAN 213c43dd3817dc12ef5ecc9501ed749773e3e921)
%% LALApps: 6.17.1.1 (CLEAN 213c43dd3817dc12ef5ecc9501ed749773e3e921)

2016-02-13 02:37:09.4398 (4548) [normal]: Reading input data ... 2016-02-13 02:37:15.5707 (4548) [normal]: Search FstatMethod used: 'ResampGeneric'
2016-02-13 02:37:15.5707 (4548) [normal]: Recalc FstatMethod used: 'DemodSSE'
2016-02-13 02:37:16.4911 (4548) [normal]: Number of segments: 12, total number of SFTs in segments: 4744
done.
% --- GPS reference time = 1132729647.5000 , GPS data mid time = 1132729647.5000
2016-02-13 02:37:16.4911 (4548) [normal]: dFreqStack = 8.271846e-007, df1dot = 1.336650e-011, df2dot = 0.000000e+000, df3dot = 0.000000e+000
% --- Setup, N = 12, T = 755999 s, Tobs = 9044863 s, gammaRefine = 100, gamma2Refine = 484, gamma3Refine = 1
2016-02-13 02:37:16.8655 (4548) [CRITICAL]: Checksum error: -6272615
% --- Cpt:25506, total:25506, sky:118/117, f1dot:1/218

2016-02-13 02:37:16.8655 (4548) [normal]: Finished main analysis.
2016-02-13 02:37:16.8655 (4548) [normal]: Recalculating statistics for the final toplist...
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/MINGW32/TARGET/windows-x32/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:51): Input toplist has zero length.
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/MINGW32/TARGET/windows-x32/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:51): Inconsistent or invalid vector length
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/MINGW32/TARGET/windows-x32/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1811): XLALComputeExtraStatsForToplist() failed with xlalErrno = 129.

XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/MINGW32/TARGET/windows-x32/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1811): Invalid pointer
2016-02-13 02:37:16.8655 (4548) [CRITICAL]: ERROR: MAIN() returned with error '-1'
FPU status flags: COND_2 PRECISION
2016-02-13 02:37:16.8655 (4548) [normal]: done. calling boinc_finish(-1).
02:37:16 (4548): called boinc_finish

1 more in progress but I think it will probably do the same.

Thanks Christian

Zalster

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197248480
RAC: 55462

Zalster: The result log you

Zalster: The result log you quoted is a problem on your host. Are at least a very unlucky coincidence. The app was preempted just while creating the output files. When it restarted it couldn't resume from this state so it failed. I'll send this along to the app developers and they'll look into making this more robust.

Anonymous

I will repost here since my

I will repost here since my first post was in the wrong thread:

FYI:

I had and "01" job with runtime/cputime ~38500

It failed with:

116.......c
.....................................c
................................c
.................................c
..................................c
..............................c
.....................................c
........
2016-02-11 22:19:22.1356 (19492) [normal]: Finished main analysis.
2016-02-11 22:19:22.1356 (19492) [normal]: Recalculating statistics for the final toplist...
2016-02-11 22:21:58.9230 (19492) [normal]: Finished recalculating toplist statistics.
2016-02-11 22:21:58.9230 (19492) [debug]: Writing output ... toplist2 ... toplist3 ... done.
FPU status flags: COND_3 PRECISION
2016-02-11 22:21:59.6677 (19492) [normal]: done. calling boinc_finish(0).
22:21:59 (19492): called boinc_finish

upload failure:
h1_0024.55_O1C01Cl1In1__O1AS20-100T_24.6Hz_171_1_1
-161 (not found)

h1_0024.55_O1C01Cl1In1__O1AS20-100T_24.6Hz_171_1_2
-161 (not found)

]]>

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4341
Credit: 252607043
RAC: 35452

RE: The problem with the

Quote:
The problem with the missing result files persisted for an unknown reason. I stopped distribution of O1AS20-100T tasks again until we can assess the situation on Monday.

This should be fixed with the new app versions 1.02 that I built and published yesterday. The remaining "tasks to send" will be distributed, more work will be created on Monday.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.