S5R3 Nearing Completion

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

16 Jul 2008 15:15:32 UTC

Topic 193765

(moderation:

)

For you folks on metered bandwidth, and/or slow connections:

Now that we are closing in on the end of the current science run, remember this might cause your host to make more frequent requests from the project for the large datapacks. The reason is there are some template frequencies where the initial set of assigned hosts needs some help in finishing up the work.

If this presents a major problem for you, the time has arrived to start thinking about switching to your backup project (or just resting) until the team gets the next science run pushed out.

Keep in mind this isn't as big a bandwidth penalty as it used to be, since the datapack size was reduced from the old 30 MB monster to the smaller 3 MB ones. So far on my hosts when they have picked up 'onesie-twosie' frequencies, this has only required picking one or two additional datapacks to fill in the blanks so they can run. However, the bandwidth demands might increase as we get closer to the end of the run and there is less selection of templates to assign to hosts requesting work.

Alinator

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 837759235

RAC: 994122

S5R3 Nearing Completion

16 Jul 2008 16:00:27 UTC

Message 82759

(moderation:

)

Hi!

Good point!

Even tho individual datafiles are ca. 3MB in size, each one covers only 0.05Hz of bandwidth and only one of two observatories (h1_* and l1_* files). A workunit can require 16 or more of those files, so if you are especially unlucky, and a newly assigned WU has no overlap at all in the frequency range with your stored datafiles, BOINC will download more than 54 MB just for a single WU.

CU
Bikeman

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119914978923

RAC: 26480211

We should be well into the

17 Jul 2008 2:35:12 UTC

Message 82760

(moderation:

)

We should be well into the final cleanup stages in less than two weeks which means that it will probably be impossible to get anything like a decent run of tasks without having to download extra large data files each time a request for new work is made. If unnecessary volume of data downloads needs to be avoided, there are some strategies you can adopt to minimise the impact.

Alinator mentioned the most obvious - give your computer a rest for about 2 weeks, starting in about a week's time (or when you notice the sequence numbers starting to get really low).

If you would like to keep crunching, you can adopt the strategy of "stocking up while the going is good". In other words, if you are currently crunching with a cache setting of say 0.1 to 1.0 days (ie "connect to network every ..." is set somewhere in that range) at an opportune time you could set your "extra days" setting to as high as 10.0 so that you get a good stock of current work. How do you pick the opportune time? Well you can "guess" this by looking at how the sequence numbers are moving.

Let's say you have the 1140Hz frequency skygrid file and the latest one of your current tasks is h1_1132.05_S5R3__679_S5R3b_0. The sequence number is 679 and as it is a high value, there is a reasonable chance of more tasks to come from the same frequency band. Sequence numbers (seq#s) range from over 1000 down to zero and they are generally used up from the top down and shared with quite a few other crunchers working simultaneously at various parts of the seq# range. If BOINC downloads your next task and it has a freq/seq# of say 1132.05/_675, this tells you a few things:-

* No frequency band change so (most probably) no extra large data files needed
* Small change in seq# so not many others grabbing this part of the range
* High seq# still so good chance of quite a few more to come
* Little chance of a major frequency change in the immediate future.

When the seq# is much lower and starts to drop rapidly, it's time to grab what's left if you really wish to avoid more big downloads. Last night I saw a machine request work and be given a seq# of 186. This morning, I wanted to see how fast the seq#s were moving so I upped the extra days on that machine through the local BOINC Manager global prefs override facility and the new task came in at 79. So I immediately upped the extra days setting even further and was given 78, 77, 76, ... with no frequency change.

Please note that I'm not encouraging anyone to do things like this unless you have a particular need. If you are not troubled by download quotas, you are far better off to just let BOINC look after things for you.

As a final point, when the scheduler does decide to send you a totally different frequency band to work on, it quite often also decides to give you the odd "resend" task as well. In a standard quorum of two, the initial tasks have either _0 or _1 appended to their name. Anything higher can be immediately identified as a "resend" where at least one of the initial quorum tasks wasn't completed for some reason (client error, missing the deadline, etc). If you are given a resend you will probably also be given an extra 50MB of data to go with it unless you are lucky enough for it to be a resend needing the same data files as those you already have.

The scheduler does seem to make some attempt to give resends out preferentially to those with the necessary data but it also does seem to have a pretty "short fuse" before it gives up and sends the work to the first poor sucker who blunders along. I've actually seen one of my machines get about 9 tasks in succession needing a number of different datasets (about 6 from memory) before finally getting a "new" task with the prospect of more "same frequency" tasks to follow. When you take into account how many hosts there must be working in any given frequency band, the scheduler needs to have a lot more patience in waiting for the "right" host to request more work. If it was programmed to wait longer, it seems likely that this would make a considerable bandwidth reduction.

So for the next few weeks, be prepared for some data mayhem out there :-). It is starting now but probably won't become too bad for another week or so. It will last probably at least a couple more weeks after that and will then tail off as the resends dry up. In many ways at this particular time, it would be really helpful if ALL resends that can't be sent to an external host that already has the correct data, could be sent to internal (cluster) machines so as to minimise the external bandwidth chaos. I have no idea if that's achievable but it would be really nice for the bandwidth constrained out there (like myself :-) ).

Cheers,
Gary.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

LOL... Well, I guess we've

17 Jul 2008 17:50:27 UTC

Message 82761

(moderation:

)

LOL...

Well, I guess we've pretty much reviewed the whole spectrum of things to expect when one science run ends and the next one begins.

So just to reiterate the main point:

If you don't have bandwidth or connection speed restrictions the best plan is to just not even let yourself get concerned, and treat the whole process like there is no major EAH project milestone approaching at all.

:-D

Alinator

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7411301687

RAC: 1940820

RE: Well, I guess we've

17 Jul 2008 21:02:25 UTC

Message 82762 in response to message 82761

(moderation:

)

Quote:

Well, I guess we've pretty much reviewed the whole spectrum of things to expect when one science run ends and the next one begins.

So just to reiterate the main point:

If you don't have bandwidth or connection speed restrictions the best plan is to just not even let yourself get concerned, and treat the whole process like there is no major EAH project milestone approaching at all.

I'd guess that a significant fraction of those reading these forums are currently using the anonymous platform mechanism (ap_info.xml) to run an ap that is either currently beta, or was beta when initially available.

I believe the "do nothing" posture is inappropriate for most of these people, as it will soon get them a pure diet of cleanup leftovers, with very high com traffic, followed by a tapering off to nothing.

As I recall, it is rather easy in turning off ap_info.xml use to kill off one's in-process and in-queue Work Units. So perhaps someone who feels conversant with the various decent methods could post here an outline of methods with the advantages and disadvantages of each.

To start the ball rolling, I'll mention one of the simplest:

1. disable new work download
2. run all work units in queue to completion.
3. when complete, assure all units are returned _and_ reported (hand update).
4. stop BOINC, and delete or remove ap_info.xml
5. restart BOINC and allow new work download.

Advantages: (fairly) simple, safe, can run any work offered in final state, will automatically track any standard ap updates, which may be frequent early in the run.

Disadvantages: requires multiple interventions, and somewhat close monitoring, will run new S5R3 work on standard ap, which may be inefficient compared to the ap_info ap you've been using.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Possibly, but I'd argue the

17 Jul 2008 21:11:38 UTC

Message 82763

(moderation:

)

Possibly, but I'd argue the case that if you've gone to the trouble of going optimized here on EAH, then you're not the type of user who routinely falls asleep at the trigger about such matters.

Personally, I deliberately left mine the last time setup for cleaning out S5R2, under the philosophy that the sooner it was history, the better. :-)

In any event, it's no big deal to add the info to app_info when the next run starts, once you start seeing your host go idle, and of course this assumes there's going to be an app name change which would cause the issue in the first place.

OTOH, your procedure seems reasonable for the bandwidth impaired. :-)

Alinator

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 837759235

RAC: 994122

RE: In any event, it's no

17 Jul 2008 21:34:35 UTC

Message 82764 in response to message 82763

(moderation:

)

Quote:

In any event, it's no big deal to add the info to app_info when the next run starts, once you start seeing your host go idle, and of course this assumes there's going to be an app name change which would cause the issue in the first place.

I'm not sure I got this right, but modifying the app_info.xml so that the app accepts S5R4 workunits would be a very bad idea. S5R4 will AFAIK use a new format for the results sent back to the server which requires a change in the app, so S5R4 WUs will be incompatible with S5R3 apps. In order to continue work for S5R4, you will have to download the new apps, which is easiest to do by removing the app_info.xml.

Or is it possible to setup the app_info.xml so that you maintain two sets of executables, one for S5R3 and one for S5R4, and the scheduler can choose which one he'd like to send you?

CU
Bikeman

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

RE: RE: In any event,

17 Jul 2008 22:27:15 UTC

Message 82765 in response to message 82764

(moderation:

)

Quote:

Quote:

In any event, it's no big deal to add the info to app_info when the next run starts, once you start seeing your host go idle, and of course this assumes there's going to be an app name change which would cause the issue in the first place.

I'm not sure I got this right, but modifying the app_info.xml so that the app accepts S5R4 workunits would be a very bad idea. S5R4 will AFAIK use a new format for the results sent back to the server which requires a change in the app, so S5R4 WUs will be incompatible with S5R3 apps. In order to continue work for S5R4, you will have to download the new apps, which is easiest to do by removing the app_info.xml.

Or is it possible to setup the app_info.xml so that you maintain two sets of executables, one for S5R3 and one for S5R4, and the scheduler can choose which one he'd like to send you?

CU

Bikeman

Yes, you can maintain two completely separate applications in app_info and they'll both work fine. I've done it on SAH Beta more than once.

The anonymous platform is no different in this regard from the automatic one except for the fact that the user must define the requirements themselves and is responsible for making the upgrades to the app and app_info as required.

Also, getting the new app is no big deal. You can DL it independently from BOINC right from the project download directory.

Alinator

roadrunner_gs

Joined: 7 Mar 06

Posts: 94

Credit: 3369656

RAC: 0

RE: (...) In order to

17 Jul 2008 23:42:13 UTC

Message 82766 in response to message 82764

(moderation:

)

Quote:

(...)
In order to continue work for S5R4, you will have to download the new apps, which is easiest to do by removing the app_info.xml.
(...)

"Houston, we've got a problem here...Â¨
I haven't the least idea how to remove the app_info.xml on one or two clients and i think there are some contributors that have the same problem. ;)

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119914978923

RAC: 26480211

If you are interested in

18 Jul 2008 2:32:06 UTC

Message 82767

(moderation:

)

If you are interested in being able to support both the resends of the current run and the new tasks of the new run when they start to appear, then check out this ancient thread about how the app_info.xml mechanism actually works and what you can do with it.

At the time of the last transition from S5R2 to S5R3, I put in place on my hosts, a mechanism which allowed me to receive both old and new tasks at any time - even simultaneously. That technique was documented in that thread as well.

You can't really "prepare in advance" too much because you have to know precise names, etc, which will probably only be "set in cement" right at the time the stuff is first released. However this doesn't really matter since there will probably be (based on previous experience) quite a period of "joint availability" during which you can set aside 10mins to edit your app_info.xml file appropriately.

So what you could do as soon as the new run is "live" and the apps are available (but you are still crunching the old) is simply download copies of the new app files and place them in your E@H project directory, edit your copy of app_info.xml there, and then stop and restart BOINC. From that point on you are "new and old app ready" and will be announcing same to the server every time you contact.

Of course, the best scenario is to have Bernd create "special test packages" for people who already are using the app_info.xml mechanism. This would help stop mistakes in editing. People would NOT get this package unless they decided to actually download it themselves. The test package should probably contain

* The latest version of the S5R3 app for their platform
* The new S5R4 app for their platform
* The correct app_info.xml that allows the two to coexist harmoniously
* and perhaps a warning for the storage and bandwidth impaired :-).

By doing this for people already using app_info.xml, it would remove (actually only postpone) the problem of people needing to understand how to correctly disable the mechanism that Peter alluded to. It would have no consequences for people currently not using it - they would be transitioned automatically at the whim of the scheduler. Also, by including item 1 in the list (not really required by people already using app_info.xml) it would provide a mechanism for people currently not using beta apps to actually participate in and help with the cleanup of "old run" tasks - which would be a good thing since that process did actually last for quite a while last time.

Thoughts please!! :-). -- If people think it's OK we can propose it to Bernd once we have thrashed out any problems. At the moment it's off the top of my head and there's always room for improvement.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119914978923

RAC: 26480211

RE: I'd guess that a

18 Jul 2008 3:34:01 UTC

Message 82768 in response to message 82762

(moderation:

)

Quote:

I'd guess that a significant fraction of those reading these forums are currently using the anonymous platform mechanism (ap_info.xml) to run an ap that is either currently beta, or was beta when initially available.

I'd guess exactly that as well.

Quote:

I believe the "do nothing" posture is inappropriate for most of these people, as it will soon get them a pure diet of cleanup leftovers, with very high com traffic, followed by a tapering off to nothing.

This is true but please don't tell them about it :-). Let them console themselves with a warm inner glow engendered by their selfless saving of the rest of E@H humanity. Then, when they can no longer bear the pain, they can be given your very good instructions :-).

Quote:

As I recall, it is rather easy in turning off ap_info.xml use to kill off one's in-process and in-queue Work Units.

Absolutely!! As i see it people really only have 3 likely options:-

* 1. Use the archae86 safe and sure procedure, or
* 2. Wait for the current result to finish and then be prepared to trash the rest of your cache, or
* 3. Replace the current app_info.xml with an edited version that will allow you to migrate to the new app.

Quote:

So perhaps someone who feels conversant with the various decent methods could post here an outline of methods with the advantages and disadvantages of each.

To start the ball rolling, I'll mention one of the simplest:

1. disable new work download
2. run all work units in queue to completion.
3. when complete, assure all units are returned _and_ reported (hand update).
4. stop BOINC, and delete or remove ap_info.xml
5. restart BOINC and allow new work download.

Advantages: (fairly) simple, safe, can run any work offered in final state, will automatically track any standard ap updates, which may be frequent early in the run.

Disadvantages: requires multiple interventions, and somewhat close monitoring, will run new S5R3 work on standard ap, which may be inefficient compared to the ap_info ap you've been using.

Peter has very competently described option 1.

Option 2. has the advantage of not having to wait so long or intervene as many times but will lead to some wasted crunching on multi-core machines unless all the cores happen to be changing tasks at much the same time. The disadvantage is that you will dump all your remaining cache onto some other poor sucker as "resends", most likely causing a lot of bandwidth wastage. Option 2. is a "desperation" option rather than a socially acceptable option :-).

I think option 3. (with the necessary package provided by the project as described in my previous message) is the best if your main aim is to transition to the new run and simply continue, rather than dropping out of beta testing completely. Use option 1. if you just want to get back to stock crunching. The big advantage of option 3. is that it can be prepared in advance by Bernd (albeit at a time cost to him) without having to rely on the editing skills of participants.

Ultimately it's only postponing something like option 1. if you ever need to go back to stock crunching.

Cheers,
Gary.

S5R3 Nearing Completion

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner