FGRPB1G WU distribution data

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692
Topic 210695

From time to time a topic here in Cruncher's Corner, or in Problems and Bug Reports, hinges in part on the variability of work content of Einstein Gamma-ray binary pulsar search work of the current flavor.  I'm intending to gather and post some data which I hope will help to clarify this topic.  My contribution will be purely observational, and will get refined over the next few weeks from some initial approximate comments and partial graphs.  I'd welcome others pitching in from actual understanding of how this works, or their own observations.

Task name structure vs. these observations.

A task as downloaded to a particular host has a name such as:

 LATeah0043L_1164.0_0_0.0_13806255_0 

If we construe this name as constituting a number of subfields delimited by underscore characters, then:

Field:  Value in example: significance for us

1.  LATeah0043L I'll call this a series.  The value which is 43 here increments about once in twelve days

2.  1164.0  I'll call this frequency.  I think it is in Hz.

6.  0  I'll call this issue number

When new issues (those with issue number equal to 0 or 1) roll on to a new series, the frequency starts at a very low value.  I've seen 4, and suspect that may be the minimum.  The early low frequency units are few in number, and issue moves on to three-digit frequencies in a very few hours.  As the frequency rises, the work content (as indicated by elapsed time for completion on a particular system) goes up, with a slight upward bias in some regions, punctuated by some major steps (the big jump between frequency 1004 and 1012 is especially clear).  Also the number of WUs issued by frequency goes up, substantially. 

I believe that nearly all the observed WU-caused variation in elapsed time arises from this frequency effect, and not from special cases of oddball units.  However there are obvious cases of system-caused ET variation.  For my graphs, I'll not censor the outliers, so you can form your own impression of their importance.

Speaking of graphs, my long-term image storage and hosting site, Photobucket, has terminally betrayed me.  So this image is my very first attempt to share the image of some data using Imgur.

The host which processed all the work in this graph is a Windows 10 system running on an  i5-4690K CPU, with two GPUs, one GTX 1060 and one GTX1070, both running overclocked but currently only running one WU at a time (1X).

This graph fails miserably to show in detail the behavior in the lowest frequency range, as the chunks in which work was fetched failed to grab work as it was being distributed across most of that range.  I suspect some other gaps in the graph are artifacts of extended cessation of fetching, not a delineation of natural issue behavior.

I am currently working to build up a graph showing the issue of issue number 0 and 1 units by days since series beginning vs. WU frequency.

People making performance comparisons need to consider at least the grosser of these effects.  In particular as WUs in the range from 1012 up are dominant (and don't vary so very much), the simple measure of confining measurements to use WUs in that range will help a good deal.  Best, of course, would be to use WUs perfectly matched as to frequency.  Even closely matched would help a lot.

 

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692

I have data for the frequency

I have data for the frequency of distributed tasks vs. days since the beginning of LAT44 unit distribution for the first three days.

 The big gap between the highest frequency task with two digit frequency (52, as it happens) and the lowest observed three digit frequency (236) spans almost two hours when I did not happen to receive any new tasks on any of the three hosts combined for this representation.  I am quite sure that tasks with frequencies between these two values exist in the LAT44 distribution, but they are clearly highly uncommon.

The graph range of three days spans about a quarter of the total distribution time I expect for LAT44 work.  I expect the slope to continue to reduce, with more and more tasks distributed per increment of frequency as the frequency goes on up to about 1204 at the end.  I have hopes of capturing better detail for the beginning of LAT45 distribution a couple of weeks from now, by adjusting work fetch parameters to get finer grained fetching.  I'll update this graph if something interesting shows up, or an any case when LAT44 distribution is finished.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Interesting, there is a step

Interesting, there is a step at 1000Hz, i will have a look over some of my data. 

Is the first graph, showing WU time (Y axis) showing all data for all LAT series or just restricted to LATeah0043L

Over what date range are the samples selected from ?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692

AgentB wrote:Is the first

AgentB wrote:

Is the first graph, showing WU time (Y axis) showing all data for all LAT series or just restricted to LATeah0043L

Over what date range are the samples selected from ?

The WU timings graph in my initial post used data from a single (two-GPU) host available in the BOINCTasks history list.  It has a very little LAT41, about a quarter from LAT42, and is about three-quarters LAT43.  All the units were reported later than mid-October, and also distributed on average about three or four days before reporting.  (my BOINCTasks settings specify that history is to be discarded after 14 days).  The LAT41 tasks were almost certainly re-issue work.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

archae86 wrote:my BOINCTasks

archae86 wrote:
my BOINCTasks settings specify that history is to be discarded after 14 days.

I increased that to 365 days to see the summer dip.

I will be using the job_log_einstein.phys.uwm.edu.txt file which contains all the tasks completed on that host - all 95K of them!  Not sure if that is what you are using as a source. 

I'll have to filter out recent cases as i have changed drivers and switched from x3 to x5 - but there's 9K LATeah0004x records to work with.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692

AgentB

AgentB wrote:

 job_log_einstein.phys.uwm.edu.txt

<snip>

 Not sure if that is what you are using as a source.

It is not.  The data I have presented in this thread is exclusively from BoincTasks.  The frequency vs. days since start of series distribution data is obtained from the Task tab, making use of the Received column.  The elapsed time vs. frequency vs. GPU employed data uses data from the history tab.

For the purpose of the work content graph it is fortunate that I happen currently to be running my GPUs at 1X.  Depending on details of system configuration, for systems running at 2X or more, identical tasks as regards work content can have considerable variation in elapsed time.  Some people here mistakenly conclude that there are forms of task work inequality that don't exist.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117846694984
RAC: 34787870

archae86 wrote:I'd welcome

archae86 wrote:
I'd welcome others pitching in from actual understanding of how this works, or their own observations.

I'm not trying to make extra work or interfere with what you are doing - I think it's a great thing you are doing for the general interest of all and the specific interest of any performance freaks :-).   So, I just want to add some information if I may.

I didn't (at first) notice your sequence of underscore delimited fields was 1., 2., 6. as my mind just saw 1., 2., 3. (the 3rd value also happened to be 0) :-).  When I read it properly, I thought it would be good to try to know about the whole 6 fields if possible.  Maybe someone in the know will fill in the details.  Fields 2 and 5 are also used in the name of the 'template' file that is downloaded with each task.

So I decided to set up a full table with all 6 fields.  I seem to recall that item 2 is a frequency in Hz perhaps related in some way to the pulsar spin frequency being searched for.  Millisecond pulsars exist but it's a bit mind boggling to think of an object that massive spinning at 1200Hz :-).  Maybe the value is some multiple of the true spin frequency.  I'm wondering if that multiple might be 4 so that spin frequencies range from about 1 to 300Hz or so.

Table of all underscore delimited Fields in the Task Name Structure.

Field Value Meaning or significance for the FGRPB1G search
1. LATeah0043L A series of tasks (all depending on a fixed data file with same name + .dat extension)
2. 1164.0 A frequency in Hz being searched for in the data (related to spin frequency??)
3. 0 Unknown (someone in the know please advise) :-)
4. 0.0 Unknown (someone in the know please advise) :-)
5. 13806255 Unknown (someone in the know please advise) :-)
6. 0 Issue number - starts from zero and records the sequence of issuing task copies

Quote:
When new issues (those with issue number equal to 0 or 1) roll on to a new series, the frequency starts at a very low value.  I've seen 4, and suspect that may be the minimum.

Yes, on quite a few series of tasks depending on a given fixed data file, the starting frequency has been 4.0 and the full series of frequencies goes up in steps of 8.0Hz - so 4.0, 12.0, 20.0, 28.0, .....  I don't believe there are any gaps in these steps, just low numbers of tasks at low frequencies so very easy to have gaps in the task names.  I call the issue numbers of 0 and 1 'primary tasks' because these are the only numbers you will ever see if both are completed and returned by the host to which they were originally issued and subsequently validated.

A number higher than 1 signifies an additional copy (a 'resend' copy) which is only needed if a primary task fails or doesn't match during validation.  From the point of view of statistical information, there's no reason to differentiate between primary tasks and resend tasks.

Quote:
I believe that nearly all the observed WU-caused variation in elapsed time arises from this frequency effect, and not from special cases of oddball units.

That is my strong impression as well for all tasks that have the 'full' work content.  Particularly at low frequencies, there are quite a few 'short ends' that don't have the full payload and therefore run considerably faster than you would otherwise expect.  Bernd has mentioned these in the past.

Quote:
However there are obvious cases of system-caused ET variation.  For my graphs, I'll not censor the outliers, so you can form your own impression of their importance.

I see these as well and I believe it may be related to double precision.  Virtually all my results come from AMD GPUs and there are some oddball behaviours occasionally.  A host running fine for a couple of weeks will see the GPU suddenly stop crunching.  CPU tasks are still running normally but the GPU seems to have crashed.  If I plug in a monitor, there is no display but I can initiate a safe shutdown and reboot from the keyboard, so the host was still 'up' at the time.  Once restarted, GPU crunching will resume from a saved checkpoint and complete in due course.

Quite often, the restart of crunching will show a GPU task in the followup stage of crunching (ie. >89.997% done although the value listed may not be correct) but the elapsed time might be ridiculously low.  Less frequently it is ridiculously high.  In both situations the %done estimate is sometimes wrong (quite low).  The task completes quickly - maybe within a few 10s of seconds of restarting, so it definitely must have been in the followup stage, even though the estimate would make you think not.  These tasks seem to validate OK.  My thinking is that there is something in the followup stage that the GPU can't handle sometimes, causing it to crash.  For some reason, even though there is a checkpoint that can be correctly used, the values for elapsed time and %done are wrong.  Good for creating some of my outliers :-).

 

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Gary Roberts wrote:I didn't

Gary Roberts wrote:
I didn't (at first) notice your sequence of underscore delimited fields was 1., 2., 6. as my mind just saw 1., 2., 3. (the 3rd value also happened to be 0) :-).  When I read it properly, I thought it would be good to try to know about the whole 6 fields if possible.  Maybe someone in the know will fill in the details.  Fields 2 and 5 are also used in the name of the 'template' file that is downloaded with each task.

I have not seen any field 3 with a non zero value in 52K tasks (starting from LATeah0001L)!

Field 4 is a little more interesting being non-zero in about 2% of tasks. These tasks usually appear first in a "sequence", and have  values such as -2e-12 to -9.6e-11.

It shows up in tasks of low "frequency" i see examples from 4 through to 52Hz nothing after that.

Quote:
A host running fine for a couple of weeks will see the GPU suddenly stop crunching.

OK -  a slight detour - Just on the this i have a suspicion based on only one similar observation - their is a VRAM memory (or some other resource) leak - whilst playing around with recent drivers i spotted the amount of memory open to OpenCL being less than one task (~1GB) when normally there would be 2GB or more.  Run a "clinfo " (or "watch clinfo") and that might shed some light.

I will have a look over LATeah0043L later.

 

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

AgentB wrote:I will have a

AgentB wrote:
I will have a look over LATeah0043L later.

I looked at 0042 instead as that had almost all its data points at x3 with driver 17.30.

Some observations looking at 2200 tasks

A frequency gap (for me)  between 52 and about 300 Hz of no tasks. 

 

LATeah0042L

 

A few outliers.

Not much to see - apologies for the wide graph....

 edit: or the narrow forum Tongue Out

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692

Gary Roberts wrote: From the

Gary Roberts wrote:
From the point of view of statistical information, there's no reason to differentiate between primary tasks and resend tasks.

Regarding performance, yes, and I've not filtered in the elapsed time vs frequency graph.  However regarding the description of issue frequency vs. time since start of series there is a strong reason to make a restriction.  I went all the way to issue number 0 to build the habit, just in case on some future work locality scheduling might make that prudent.  For the project in current discussion allowing both 0 and 1 would work fine, but allowing higher numbers would greatly compromise some uses of the graph.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232101326
RAC: 1158692

I've accumulated a frequency

I've accumulated a frequency vs. days since distribution began data collection covering the entire distribution period for LAT44, plus the beginning of LAT45, including data for tasks received by all three of my hosts, so long as the replication was 0 or 1.

There is clearly a major distinction between tasks distributed with task names embedding a frequency of 4 to 52, and the higher ones. 

As an aid to reviewing this early behavior, I've made an expanded scale graph of the early portion of LAT45 task distribution, and spent time up before dawn gradually increasing requested queue depth and repeatedly doing forced updates in an attempt to fill in this portion of the behavior.

1. Sampling sparsity aside, the early region for LAT45 looks just like that for LAT44.  This is not a single-time abnormal behavior.

2. The tasks in the 4 to 52 portion ALL have non-zero values in field 4 of the task name (or nearly all, I may have spotted an exception among hundreds), while the tasks above 52 ALL have the value 0.0 in that field.

3. While the servers spent about seven hours per frequency distributing individual frequency values in the late part of the twelve-day cycle, in the early part frequency bin times are far shorter, strongly suggesting that far fewer tasks are generated per frequency.  This is most strongly true not of the 4 to 52 region, but of the region just above 52, which does not recover to distribution duration per frequency similar to that in the 4 to 52 portion until about frequency 600 (as can be eyeballed by looking at the slope of the graph).

4. Effects of one or more buffers in the distribution pipeline, with non FIFO behavior, are apparent in that even with the restriction to replication 0 tasks only, some out-of-order frequency distribution takes place.  This is especially dramatic in the early part of the "above 52" region, in which tasks retrieved in a single request by a single host can be from five or more frequencies.

I tried making a computation elapsed time vs. task frequency graph for one of my hosts for LAT44 data, but I had not manipulated the task fetch properly to get a well filled-in graph.  I'm trying to do better on LAT45, which is a result I intend to post a couple of weeks from now.

 

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.