Gravitational Wave search O1 all-sky tuning (O1AS20-100T)

Jonathan Jeckell

Joined: 11 Nov 04

Posts: 114

Credit: 1341974894

RAC: 914

My Ubuntu Linux box has

13 Feb 2016 18:21:29 UTC

Message 136857

(moderation:

)

My Ubuntu Linux box has barfed on 3 of the 5 in its queue too (still processing the remaining 2). These things happen as we work out the bugs, but I was honestly hoping to be one of the first to help contribute to this.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117543430120

RAC: 35342239

I found this list of tasks on

13 Feb 2016 21:26:52 UTC

Message 136858 in response to message 136857

(moderation:

)

I found this list of tasks on one of your hosts (ID=12151761). At the time I saw it there were 3 compute errors out of 8 total - 5 still in progress. The interesting thing is that the original 5 tasks are listed as V1.00 and the 3 latest are listed as V1.02.

You should consider aborting the last 2 of the original 5 (the V1.00) ones as they seem doomed to failure anyway. You could try one of the new ones to see if it works. If you follow the WU links for each of the new ones, you can see what has happened previously for those. This might give you some idea of your prospects for ultimate success for any of them. There is one WU that has 2 'in progress' tasks but one of those is V1.00 (on 32bit Linux) and there is already a failed V1.00 task also on 32bit Linux. The quorum I'm talking about is this one but take a look at all three - it's good experience for tools to use when trying to work out what's going on.

Cheers,
Gary.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 316040836

RAC: 334433

FWIW : this host on this

14 Feb 2016 0:01:04 UTC

Message 136859

(moderation:

)

FWIW : this host on this result went belly up after a one-byte file access error.

Quote:

2016-02-12 02:46:29.5682 (30867) [normal]: Reading input data ... ERROR: data gap or overlap at first bin of SFT#0 (GPS 1128211934.000000) expected bin 90359, bin 90360 read from file '../../projects/einstein.phys.uwm.edu/h1_0050.20_O1C01Cl1In1'
XLAL Error - XLALLoadSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:882): I/O error
XLAL Error - XLALLoadMultiSFTsFromView (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1046): Failed to XLALLoadSFTs() for IFO X = 0

XLAL Error - XLALLoadMultiSFTsFromView (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1046): Internal function call failed: I/O error
XLAL Error - XLALLoadMultiSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1004): Check failed: ( multiSFTs = XLALLoadMultiSFTsFromView ( multiCatalogView, fMin, fMax )) != ((void *)0)
XLAL Error - XLALLoadMultiSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1004): Internal function call failed: I/O error
XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:405): Check failed: ( multiSFTs = XLALLoadMultiSFTs(SFTcatalog, minFreqFull, maxFreqFull) ) != ((void *)0)
XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:405): Internal function call failed: I/O error

ie. possibly the supplied data file and not the app itself.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

I awoke to find this host

14 Feb 2016 9:28:08 UTC

Message 136860

(moderation:

)

I awoke to find this host with v1.02 tasks, four completed with no errors. (and CPU temps up 10C GPU temps down 15C - expected)

To my pleasant surprise - this task has a MAGIC QM wingman assigned.

Anonymous

one "01" job completed at

14 Feb 2016 21:56:00 UTC

Message 136861 in response to message 136860

(moderation:

)

one "01" job completed at ~37+. Currently in a pending state. This is a V1.02 job. Running on a GTX 770 Linux machine.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1886

Credit: 1406851261

RAC: 1173757

RE: I awoke to find this

14 Feb 2016 21:59:59 UTC

Message 136862 in response to message 136860

(moderation:

)

Quote:

I awoke to find this host with v1.02 tasks, four completed with no errors. (and CPU temps up 10C GPU temps down 15C - expected)

To my pleasant surprise - this task has a MAGIC QM wingman assigned.

Sorry about that AgentB

That is the ONLY one of my 7 hosts not in my house and it got a "Not started by deadline - canceled"

THAT would not happen with the 6 hosts that have me staring at them 24/7

(no I never sleep)

So maybe you will have better luck in the future.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117543430120

RAC: 35342239

RE: ... That is the ONLY

14 Feb 2016 23:57:35 UTC

Message 136863 in response to message 136862

(moderation:

)

Quote:

... That is the ONLY one of my 7 hosts not in my house and it got a "Not started by deadline - canceled"

What heresy!! Looks like the machine got turned off for the duration sometime after after a whole bunch of O1AS tasks got downloaded!! :-). 22 canceled and 4 timed out - no response. I guess those 4 were in progress and couldn't be canceled. Shouldn't you just abort them to save the waste?

Look on the bright side - they were all V1.01 so I guess that would have been just a whole lot of wasted computing ;-). The machine has obviously now been turned back on since all those cancelled tasks have been reported and there is a new V1.02 replacement :-).

Now that V1.02 looks the goods (as long as they fix the estimates so that the DCF doesn't go absolutely bonkers) it might just be time to consider sticking my toe in the water :-).

Cheers,
Gary.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1886

Credit: 1406851261

RAC: 1173757

Well Gary that is what

15 Feb 2016 4:49:00 UTC

Message 136864 in response to message 136863

(moderation:

)

Well Gary that is what happens when you install Boinc to do Einstein GPU tasks on your sister-in-laws laptop as payment for spending hours installing Windows 10 and all the updates for her

And since they all automatically got set to run these new tasks I had no way to turn that off on hers and I am surprised that hers got 27 of them and all 6 of my home hosts only got 4 and even then 3 of those went on my old 3-core so it started running all 3 cores and turning off the other CPU tasks (vLHC)

WHY did Boinc decide to give her not very fast quad-core 27 tasks?

And NONE for my 8-core or any of the quad-core?

I changed hers from here (Location setting) not to get anymore but not much I can do about that one that was doing fine with the GPU tasks getting those CPU tasks all of a sudden.

I only have 3 finished so far (at home) and it took took them over 79,000 seconds and they are all pending so far.

She isn't a youngster so a laptop with a new OS is not something she is a expert on (I guess I'm not a youngster either but I am a mad scientist)

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117543430120

RAC: 35342239

RE: WHY did Boinc decide to

15 Feb 2016 6:04:20 UTC

Message 136865 in response to message 136864

(moderation:

)

Quote:

WHY did Boinc decide to give her not very fast quad-core 27 tasks?

Bad luck, I guess.

There are some previous BRP6 tasks crunched on the Intel GPU. The last was sent on Feb 05 and returned on Feb 07. There doesn't seem to be anything else until Feb 12. At that time the machine got 4 BRP6 followed by small batches of O1AST at roughly 1 minute intervals. Looks like the cache setting was large enough (and perhaps the estimate on the new tasks was small enough) to allow the whole 26 to be downloaded over a period of about 6 minutes. Perhaps the previous BRP6 done on the Intel GPU had been done faster than expected and had caused the DCF to be rather lower than appropriate for the GW tasks.

I'm guessing this just happened to correspond to the release of the V1.01 app together with tasks available for download and a low DCF to encourage lots of them. Just the luck of the draw. I've seen this sort of thing happen before so when there's a brand new app ready to be tested, I try to avoid the very first flush of tasks. When I want to join in, I make sure my cache setting is so low that I can't get more than a couple to start with. I have a dual core machine right now with about 2-3 hours to go on the last two FGRPB1. It's only asking for O1AST tasks now and it has a cache setting of 0.25 days. I reckon it will be ready to download and crunch as soon as they make some more available. It might even get a resend of one of the previous lot.

Quote:

I changed hers from here (Location setting) not to get anymore but not much I can do about that one that was doing fine with the GPU tasks getting those CPU tasks all of a sudden.

You should put it back on just BRP6 for the Intel GPU. It was doing very well on those. You really don't want to run alpha test stuff on a machine you can't directly access :-).

Quote:

... (I guess I'm not a youngster either but I am a mad scientist)

We're ALL mad scientists around here mate, even if we were something else in another life ;-). Take a look at our resident 'refugee from an otherwise highly esteemed profession'. He obviously knows more about black holes than about how to cure a pain in a black hole ... I rest my case :-).

Cheers,
Gary.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188317469

RAC: 303574

Another Update: The 1.02

15 Feb 2016 16:24:00 UTC

Message 136866

(moderation:

)

Another Update:

The 1.02 apps solve the missing result file problem (upload failure -161) and we already receive all of the result files. The validator is already running and we keep a lookout for any validation errors.

We will grant Credit to all those who suffered from the upload failure later this week.

There will be an update to 1.03 shortly that fixes some problems with checkpointing that we found.

I'm also going to generate more work after the apps are updated so your machines can keep busy.

We are aware that runtimes seem to be "off the scale". But his was a little bit expected so we can tune the main search. The runtimes on a host seem to be consistent. Why some hosts take 6h and some 24h we don't know yet. I will dig into that when there are more successful results available to make a proper statistic.

If you find new problems with the 1.03 version please open a new thread in Problems and Bug Reports.

Gravitational Wave search O1 all-sky tuning (O1AS20-100T)

Forums › Technical News

Comment viewing options

Forums › Technical News