Locality Scheduling behaviour in the new O1AS20-100F run

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118578280905

RAC: 17614324

11 Mar 2016 5:48:04 UTC

Topic 198501

(moderation:

)

Locality Scheduling (LS) has been used for many year in GW science runs. The basic aim is to conserve server bandwidth, client bandwidth and client disk space requirements by attempting to re-use existing data files (where possible) when sending new tasks. If you look at any list of GW task names not IDs, you can see two suffixes. The last one is usually _0 or _1 which simply denotes the two primary members of a workunit quorum. If either or both of those fail for any reason, you will see 'resends' which have _2 or higher for that suffix.

The suffix before last is a much bigger number usually. You can see some in the above list as high as _2475. I call that a sequence number. It is a number in a sequence that will gradually decline by one as each subsequent task for that sequence is issued. With 2 tasks per quorum, there are close to 5000 tasks all told still to be issued for that particular example.

Earlier in the task name you can see two frequency values - 94.50Hz and 94.60Hz (minus the Hz) - as parts of the task name. This is related to the spin frequency of an object that might be emitting a continuous GW signal. To search for such an object, data files (both h1_... and l1_...) are required covering an even wider frequency range - around 12 files (24MB) in total.

The scheduler could send a task and the large data file payload to ~5000 different hosts at random, OR it could send them to a relatively small number of hosts (say 50) and make the other ~4950 hosts use a range of other frequency sets. This is the basic aim of LS. The difficulty is designing a scheduler that can get a balance between all the different possible frequency 'bins' and the number of hosts feeding at each separate bin. If there are far too few hosts at any particular bin, you can have the situation where there's a big time delay in the sending out of the second task in a quorum.

For LS to work, a client needs to keep its store of large data files and report what it has to the scheduler on each work request. The scheduler looks at what a client already has and sends further tasks for that data. There would be lots of tasks for that same data so no extra downloads until the sequence number goes to zero. The mechanism that allows this to work is by making the data files 'sticky' - the client isn't allowed to delete them when the task finishes because there could be more tasks still to come. The scheduler can inform the client when data can safely be deleted.

I've explained all this because I'm asking for a favour from other volunteers. I'm seeing a problem with the working of the 'sticky' mechanism and I'm wondering if it's BOINC version related or something else. I want to know if others are seeing something different. First, here is an example of what the scheduler sends to the client with a data file.

  
  
  
    h1_0076.40_O1C01Cl2In3.hlkC
    http://einstein.phys.uwm.edu/EinsteinAtHome/download/40/h1_0076.40_O1C01Cl2In3.hlkC
    94603b3b8a00a73e8e0ea7d942138341
    2006768

Notice there is a flag and a flag in there. The next block is an example of the information that gets inserted into the state file so the client knows how to handle a data file. It's for a different frequency but I see the same for all frequencies.

    h1_0050.15_O1C01Cl2In3.zQ7m
    2006768.000000
    0.000000
    018cfe480b2c9114aefbd9e4b9c71b4d
    1
    http://einstein.ligo.caltech.edu/download/20d/h1_0050.15_O1C01Cl2In3.zQ7m
    http://einstein-dl3.phys.uwm.edu/download/20d/h1_0050.15_O1C01Cl2In3.zQ7m
    http://einstein-dl2.phys.uwm.edu/download/20d/h1_0050.15_O1C01Cl2In3.zQ7m

Notice there is no flag. When the tuning run was on, I know for certain that the above ... block example had an extra line embedded - a line with the flag. So something is now different. I've actually tested this by promoting and returning single tasks where there is just one task at present for a given frequency set. As soon as the task is reported all the data files on disk get deleted and the ... blocks in the state file get removed.

I'm wondering if the client BOINC version has any bearing on this. I'm using 7.2.42 for Linux downloaded from Berkeley. My distro doesn't keep BOINC in the repo. The 'favour' I'm asking is for people with more recent BOINC versions (and F type GW tasks on board) to check by browsing a copy of the state file (don't want to damage the real deal) in a plain text editor to find any ... entry like the above. Please don't attempt this if you don't know how or you're not totally comfortable. Please report in this thread the BOINC version and whether or not there is a flag in the block. I've reported this to the Devs but it would be good to know if others are seeing this as well and if BOINC version has anything to do with it. Thank you.

Cheers,
Gary.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 196961446

RAC: 202093

Locality Scheduling behaviour in the new O1AS20-100F run

11 Mar 2016 7:33:52 UTC

Message 137759

(moderation:

)

Thanks for explaining Gary.

Here is a file entry from 7.6.31 on Linux (Debian version) which has the sticky Flag correctly set.

    h1_0045.95_O1C01Cl2In3.n4Wo
    2006768.000000
    0.000000
    c4e6f058608f6d2cd6228f490208730b
    1
    
    http://einstein2.aei.uni-hannover.de/download/256/h1_0045.95_O1C01Cl2In3.n4Wo
    http://einstein-dl.syr.edu/download/256/h1_0045.95_O1C01Cl2In3.n4Wo
    http://einstein-dl3.phys.uwm.edu/download/256/h1_0045.95_O1C01Cl2In3.n4Wo

I'm still curious if other 7.2.42 users see the same as you. But then again why did that change between the T run and the F search using the same Client version? That would imply that something on the serverside changed that causes this. We'll need some more info on other hosts this is also happening and I'll look through the logfiles.

Benva

Joined: 19 Jul 08

Posts: 4

Credit: 12838249

RAC: 0

Boinc version 7.6.22

11 Mar 2016 12:21:46 UTC

Message 137760

(moderation:

)

Boinc version 7.6.22 (x64)
OS: Microsoft Windows 10 Professional x64

    h1_0087.95_O1C01Cl2In3.HoTy
    2006768.000000
    0.000000
    67bd3663d0e4a2f23884a91ee4119998
    1
    
    http://einstein2.aei.uni-hannover.de/download/27d/h1_0087.95_O1C01Cl2In3.HoTy
    http://einstein-dl.syr.edu/download/27d/h1_0087.95_O1C01Cl2In3.HoTy
    http://einstein-dl3.phys.uwm.edu/download/27d/h1_0087.95_O1C01Cl2In3.HoTy

Jasper

Joined: 14 Feb 12

Posts: 63

Credit: 4032891

RAC: 0

Thank you also from me for

11 Mar 2016 18:21:07 UTC

Message 137761

(moderation:

)

Thank you also from me for spelling out some of it, especially the scheduling part which for me, was quite obscure so far. I love to look at such details, although I wouldnÂ´t know what to do with such knowledge.

My contribution to it, a pair of them but unfortunately not your version (in light of ChristianÂ´s reaction).
BOINC 7.6.22 and OS X 10.11.3 (El Capitan).

    h1_0097.05_O1C01Cl2In3.NQU9
    2006768.000000
    0.000000
    47c1669e150c96f63c7468bb306706c6
    1
    
    http://einstein2.aei.uni-hannover.de/download/2b/h1_0097.05_O1C01Cl2In3.NQU9
    http://einstein-dl.syr.edu/download/2b/h1_0097.05_O1C01Cl2In3.NQU9
    http://einstein-dl2.phys.uwm.edu/download/2b/h1_0097.05_O1C01Cl2In3.NQU9

l1_0097.05_O1C01Cl2In3.NQU9
1988248.000000
0.000000
a3834cf30f908ac089ee63acb6b0dba2
1

http://einstein2.aei.uni-hannover.de/download/189/l1_0097.05_O1C01Cl2In3.NQU9
http://einstein-dl.syr.edu/download/189/l1_0097.05_O1C01Cl2In3.NQU9
http://einstein-dl2.phys.uwm.edu/download/189/l1_0097.05_O1C01Cl2In3.NQU9

Thunder

Joined: 18 Jan 05

Posts: 138

Credit: 46754541

RAC: 0

I'm not sure how to separate

11 Mar 2016 20:33:40 UTC

Message 137762

(moderation:

)

I'm not sure how to separate out files that had to do with the older "Tuning" run from the new ones.

I know that I went down more entries in this 121kb .xml file than I care to count and all of them had a tag, but frankly my patience ran out as I realized I'd lost count of how many.

Give me a way to separate the F, I and T files and I'll dig deeper.

(For what it's worth, this was on a 7.6.22 client on Win 10 Pro 64-bit)

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118578280905

RAC: 17614324

RE: I'm still curious if

12 Mar 2016 8:48:47 UTC

Message 137763 in response to message 137759

(moderation:

)

Quote:

I'm still curious if other 7.2.42 users see the same as you. But then again why did that change between the T run and the F search using the same Client version? That would imply that something on the serverside changed that causes this. We'll need some more info on other hosts this is also happening and I'll look through the logfiles.

First of all, many thanks to those who responded. I appreciate it that you tried to help.

I have spent a bit of time delving into this and I think I have enough evidence to claim that there was something that happened serverside that caused this and that it doesn't seem to be the 7.2.42 BOINC version. I have found this 'evidence' just by looking through task lists and state files using a range of my own machines.

First of all, over quite a few hosts, tasks received on or prior to 10 Mar 04:25:35 UTC that required downloading of data files, all had ... entries in the corresponding state files where the flag was not set.

By comparison, again over a number of hosts, tasks received on or after 10 Mar 06:05:34 UTC had ... entries in the corresponding state files where the flag was set.

Since all the hosts were crunching away minding their own business and not being 'fiddled with' or modified, it would appear that something happened during this 1h40m window that caused the change in behaviour. At this point it would appear to be a serverside change.

When I first noticed this problem, I withdrew all hosts from downloading any further O1ASF tasks. Each host was acquiring multiple frequency ranges so there were lots of large data file downloads that were going to be thrown away after just one task. Now that I've satisfied myself that every new data download has the flag in the state file, I'll switch these hosts back to getting O1ASF tasks again.

I'd like to ask a special favour of Benva who responded earlier in this thread. Here is a link to his GW tasks showing task names. Scroll down that list and find two completed tasks where the frequency is 89.25Hz. There are only two tasks at this frequency. After that it goes to 94.10Hz.

If the two tasks for 89.25Hz were sent with the flag, the large data files both h1_... and l1_... for 89.25 through to 89.50 should all still be there waiting in the einstein project directory for more tasks. If there was no sticky flag, those 12 data files would have been deleted. The only person who can confirm this is Benva by seeing if those files still exist in his einstein project directory.

Of course, other people could do exactly the same using their own list of tasks and finding completed tasks of their own, sent before the cut off time and with no more tasks of that frequency still to be crunched. This is exactly what I've done on a range of my hosts trying to narrow down exactly when the ,sticky/> flag started being correctly inserted. This time gap is shown in the two bold values above.

I'd love to hear if people have such 'missing' data files that they once would have had. maybe you could narrow down the time gap even further :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118578280905

RAC: 17614324

RE: I'm not sure how to

12 Mar 2016 9:12:27 UTC

Message 137764 in response to message 137762

(moderation:

)

Quote:

I'm not sure how to separate out files that had to do with the older "Tuning" run from the new ones.

Here is the O1ASF tasks list for your Xeon running Windows server. Notice there are two completed tasks for a frequency of 50.50 which were sent before the critical time. If these had no sticky flag, you wont be able to find 12 large data files which have a first frequency of 50.50 through to 50.75 as part of their name. You have to browse the einstein project directory to look for these files. They will start with either h1_0050.50_... or l1_0050.50_... right through to h1_0050.75_... or l1_0050.75...

There's also a single task at 59.85Hz in your list. The data files for that one should be missing too.

Cheers,
Gary.

Benva

Joined: 19 Jul 08

Posts: 4

Credit: 12838249

RAC: 0

RE: If the two tasks for

12 Mar 2016 9:50:30 UTC

Message 137765 in response to message 137763

(moderation:

)

Quote:

If the two tasks for 89.25Hz were sent with the flag, the large data files both h1_... and l1_... for 89.25 through to 89.50 should all still be there waiting in the einstein project directory for more tasks.

I have no such files as h1_0089... or l1_0089... in my data directory

Cheers,
Benva

Jasper

Joined: 14 Feb 12

Posts: 63

Credit: 4032891

RAC: 0

I donÂ´t know what to look

12 Mar 2016 11:09:28 UTC

Message 137766 in response to message 137763

(moderation:

)

I donÂ´t know what to look for exactly, so this is probably not of much help. It is a long time I didnÂ´t look into EinsteinÂ´s data directory, this thread made me do so again. A few things then caught my eye:

â€¢ all 12 files for the current

F run - 6 h and 6 l - are stamped 10 Mar 2016 00:37; â€¢ all files for the Tuning run are still there too, those came between 22 Feb 2016 13:11 and 29 Feb 2016 02:02; I cannot say itÂ´s everything, donÂ´t know that, but it is an awful lot;
â€¢ earth and sun .dat files are stamped 22 Feb 2016 13:11;
â€¢ the skygrid .dat is stamped 10 Mar 2016 00:37All files above are in the current state file, where expected with their flag set. There is no already completed data left over from any of the other applications, thankfully.

That aside, there is a bunch of every application executable ever run on this computer and numerous more corresponding (but not all) sildeshow_einstein_* looking related. I suspect that I could delete the really old stuff in there myself, just did not dare to so far. I have excluded BOINC Data from my (Time Machine) backup settings, so maybe IÂ´ll create just a copy somewhere else later, see what happens when I do that, but IÂ´m rather unsure. ItÂ´s not really important, although there are 285 files in that folder, taking 349.5MB on disk.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118578280905

RAC: 17614324

RE: RE: If the two tasks

12 Mar 2016 13:09:42 UTC

Message 137767 in response to message 137765

(moderation:

)

Quote:

Quote:

If the two tasks for 89.25Hz were sent with the flag, the large data files both h1_... and l1_... for 89.25 through to 89.50 should all still be there waiting in the einstein project directory for more tasks.

I have no such files as h1_0089... or l1_0089... in my data directory

Benva,

Thank you very much for confirming that. It means that the data files for those two tasks at the 89.25 frequency were not marked as sticky. It means that the files were deleted when the tasks were reported. It means a lot to me because it indicates it was a real problem for others and not something unique to me.

It's clear the problem has been fixed so we should now get lots of tasks for a particular frequency range without any risk of data files being deleted only to be re-downloaded later. In your case, your machine is getting lots of tasks for the frequency 87.95. You have contiguous sequence numbers from _2214 to _2198. Locality scheduling should allow you to keep getting lower sequence numbers, all the way down to zero. The numbers wont always be contiguous. There will be other hosts feeding on that same frequency as well :-).

Once again, thank you for the response. It was good to have the confirmation.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118578280905

RAC: 17614324

RE: I donÂ´t know what to

12 Mar 2016 13:42:01 UTC

Message 137768 in response to message 137766

(moderation:

)

Quote:

I donÂ´t know what to look for exactly, so this is probably not of much help. It is a long time I didnÂ´t look into EinsteinÂ´s data directory, this thread made me do so again. A few things then caught my eye:
â€¢ all 12 files for the current
F run - 6 h and 6 l - are stamped 10 Mar 2016 00:37; ....

Here is the full list of all the O1ASF series of tasks for your machine, sorted by task name. Notice you have two frequencies, 81.90 and 96.80 - so you really should have 24 large data files on board. You only have 12 because those for the 81.90 frequency would have been deleted when that task at the top of the list was returned and reported. So you have also confirmed that there was a time when the sticky flag wasn't set.

Quote:

All files above are in the current state file, where expected with their flag set.

It wouldn't have been set for those 12 files when the first tasks at 96.80 were sent, but because you kept getting more of those tasks, the sticky flag would have been added when the problem was resolved. I saw exactly that happen at one stage on one of my hosts.

Quote:

That aside, there is a bunch of every application executable ever run on this computer and numerous more corresponding (but not all) sildeshow_einstein_* looking related. I suspect that I could delete the really old stuff in there myself ...

If you delete anything listed in the state file (no matter how old) BOINC will probably grumble and just download the stuff again. If you're really adventurous (highly not recommended) you can stop BOINC, delete the entry for an old app, etc, and then restart BOINC. Then you can delete the actual file on disk. You really need to know what you're doing - it's so easy to get it wrong. With OS X it's doubly difficult because of where things are hidden and because of file ownership and permissions. Caveat Emptor!!!

Cheers,
Gary.

Locality Scheduling behaviour in the new O1AS20-100F run

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports