Major searches FGRPB1G and O1OD1 out of work - This looks like a problem.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1611783458

RAC: 705920

24 Feb 2019 19:36:21 UTC

Topic 218259

(moderation:

)

This has been going on for quite a while.

2/24/2019 11:29:34 AM | Einstein@Home | No work is available for Gamma-ray pulsar binary search #1 on GPUs

2/24/2019 11:29:34 AM | Einstein@Home | No work is available for Gravitational Wave All-sky search on LIGO O1 Open Data

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118425695168

RAC: 25860704

It's always useful to specify

24 Feb 2019 20:44:50 UTC

Message 169740

(moderation:

)

It's always useful to specify what the problem is, so I edited your thread title to add the information.

For FGRPB1G, there has been no 'ready to send' work for around 24 hours now. There have been occasional 'resends' for failed or expired tasks but no new 'primary' tasks. Probably tasks for the current data file LATeah1049L.dat have all been distributed and there is no replacement data file ready to take over. Based on the times that similar files have lasted, that file was ready for replacement anyway.

The server status page still shows 960 ready to send tasks for O1OD1 so I don't know why you get a message for that search as well. I seem to remember there was a fairly similar number around 12 hours ago when I last looked so perhaps something is 'stuck' and it's a different issue to the FGRPB1G case.

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I tested and both Linux and

24 Feb 2019 21:08:49 UTC

Message 169741

(moderation:

)

I tested and both Linux and Windows hosts were able to get a O1OD1 task.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

A new data set 1049M is sent

24 Feb 2019 22:09:20 UTC

Message 169742

(moderation:

)

A new FGRPB1G data set 1049M is being sent out now.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118425695168

RAC: 25860704

Yes indeed, what a pleasant

24 Feb 2019 22:39:04 UTC

Message 169748

(moderation:

)

Yes indeed, what a pleasant surprise. I was resigned to waiting for Monday morning in Hannover for a fix.

To make up for all the data file deletions when hosts run out of work, either originally or when the occasional resend gets returned, I decided about 45 mins ago to run another data file topup run where the cached data files get replenished from my local cache of such files. After syncing, each host is 'encouraged' to do a work fetch attempt, just in case :-).

As luck would have it, one of the hosts got new work and scored the download of 1049M. The script logic notices this, suspends current operations and immediately deploys the new file to every other host in the fleet. This means that 100 hosts all trying to download the new file plus any missing 'old' files was completely avoided.

The most pleasing thing about this was the fact that this was a pretty thorough test of the logic behind handling this sort of event and everything seems to have worked as intended. All hosts are back crunching and I haven't seen any indication in the logs of any host getting data files from the project rather than the local file cache.

Cheers,
Gary.

Major searches FGRPB1G and O1OD1 out of work - This looks like a problem.

Forums › Problems and Bug Reports

It's always useful to specify

I tested and both Linux and

A new data set 1049M is sent

Yes indeed, what a pleasant

Comment viewing options

Forums › Problems and Bug Reports