Major searches FGRPB1G and O1OD1 out of work - This looks like a problem.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1611783458
RAC: 705920
Topic 218259

This has been going on for quite a while.

2/24/2019 11:29:34 AM | Einstein@Home | No work is available for Gamma-ray pulsar binary search #1 on GPUs

2/24/2019 11:29:34 AM | Einstein@Home | No work is available for Gravitational Wave All-sky search on LIGO O1 Open Data

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118425695168
RAC: 25860704

It's always useful to specify

It's always useful to specify what the problem is, so I edited your thread title to add the information.

For FGRPB1G, there has been no 'ready to send' work for around 24 hours now.  There have been occasional 'resends' for failed or expired tasks but no new 'primary' tasks.  Probably tasks for the current data file LATeah1049L.dat have all been distributed and there is no replacement data file ready to take over.  Based on the times that similar files have lasted, that file was ready for replacement anyway.

The server status page still shows 960 ready to send tasks for O1OD1 so I don't know why you get a message for that search as well.  I seem to remember there was a fairly similar number around 12 hours ago when I last looked so perhaps something is 'stuck' and it's a different issue to the FGRPB1G case.

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I tested and both Linux and

I tested and both Linux and Windows hosts were able to get a O1OD1 task.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

A new data set 1049M is sent

A new FGRPB1G data set 1049M is being sent out now.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118425695168
RAC: 25860704

Yes indeed, what a pleasant

Yes indeed, what a pleasant surprise.  I was resigned to waiting for Monday morning in Hannover for a fix.

To make up for all the data file deletions when hosts run out of work, either originally or when the occasional resend gets returned, I decided about 45 mins ago to run another data file topup run where the cached data files get replenished from my local cache of such files.  After syncing, each host is 'encouraged' to do a work fetch attempt, just in case :-).

As luck would have it, one of the hosts got new work and scored the download of 1049M.  The script logic notices this, suspends current operations and immediately deploys the new file to every other host in the fleet.  This means that 100 hosts all trying to download the new file plus any missing 'old' files was completely avoided.

The most pleasing thing about this was the fact that this was a pretty thorough test of the logic behind handling this sort of event and everything seems to have worked as intended.  All hosts are back crunching and I haven't seen any indication in the logs of any host getting data files from the project rather than the local file cache.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.