RAC ? - is it of any use? Why bother to look at it or record it?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: I'm pretty fascinated

Quote:
I'm pretty fascinated how powerful script could be.

There are a number of ports of these utilities which i use on windows especailly at work. If i recall http://gnuwin32.sourceforge.net/packages.html was my last foray.

Not quite as elegant as native unix, but close. When confronted with a 500 MB csv log file, and you need columns 4-7, 17 and 33 - and the last 100 rows which match some complex string, which needs tidying up - priceless. Then it loads nicely into into Excel.

The first thing to learn are the basic ¨regular expressions¨ which are both elegant and powerful way of describing a pattern, and yes it look like stuff from the Matrix!

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: By counting the number

Quote:
By counting the number of returned tasks, the problem that really interests me (host appears normal but has stopped returning GPU tasks) should be 'seen' within the 8 hour cycle time, in most cases, and certainly well within 16 hours at worst.

@Gary I had a think about what value to alert on.

If you are running 7*24 and say 6 tasks then the Total time of all elapsed tasks should be above a certain level - in my case it usually above 5.8 computing days per day and that will even out short and long tasks. Monitoring that should trap any hangs either on GPU or CPU hangs.

If i get keen i might set up a cron job to email me and reboot itself when it drops below 5.

You still need to monitor RAC to detect upload problems and invalids, so don´t give up on the RAC.

cliff
cliff
Joined: 15 Feb 12
Posts: 176
Credit: 283452444
RAC: 0

Hi Mikey, RE: I

Hi Mikey,

Quote:

I have 15 pc's running with 11 of them crunching various Boinc Projects, logging onto each one every morning and looking at its rac gives me a quick look at how it's doing compared to yesterday or a week ago etc. Going to the different project webpages gives me a better look at how each pc is doing compared to my other pc's at the same project.

I tend to look at the number of credits allocated on a daily basis, its no more an accurate determination of 'my' throughput, but I'd say it gave as good an approximation of both my & wingpersons validated tasks as anything else.

Anyway a higher figure is much more ego boosting than a lower one:-)

I've long since gotten over the idea that I could ever outpace the top crunchers, even in the 'classic' days it was obvious that anyone with a significant number of rigs crunching 24/7 would ouptperform my kit:-)

So I'm happy to trundle along, and crunch what evers available in the astronomy line of DC projects.

What I wont do is crunch for private enterprise's that will then patent the results for their own profit. Those folks can buy their own kit and pay to run it.

The real problem is that far too many DC projects keep running into major problems with 'their' kit, leading to virtual shutdowns:-(
So far over the past few months S@H, E@H & A@H have all suffered from significant problems and outages. MW@H seems fairly stable, touch wood:-)

So throughput is compromised by server side problems, even down to my level with 2 rigs there has been a definite shortage of WU & therefore diminished throughput.

Regards,

Cliff,

Been there, Done that, Still no damm T Shirt.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117990921468
RAC: 21127644

RE: @Gary I had a think

Quote:
@Gary I had a think about what value to alert on.


Thanks very much for your interest.

Essentially, I have two distinct groups of GPUs. Firstly, there are GTX550Tis, 650s, and a sole 650Ti. They are all running on relatively modern motherboards (a few Sandy Bridge, but mainly Ivy Bridge) and I can't recall any of these ever showing the 'GPU stops progressing' problem. Secondly, there is a large group of HD7850s, some of which are running on 6 year old architectures, both Intel and AMD, and the rest on Haswell and Haswell refresh. The problem shows up in the 'old' architectures and seems to be on particular motherboard models. All hosts are running in the same environment but I see the problem repeating on very particular hosts (both Intel and AMD) but not at all on other hosts - eg a group of Haswells. The cycle time for repeats of the problem is quite long - of the order of 60 days and doesn't seem to be temperature related.

With the advent of the beta-1.52 app, all my HD7850s (irrespective of host architecture) are now producing 4 concurrent tasks every 5 hours (pretty closely). With the 8 hour interval between script visits, a 7850 enabled host will have produced at least 4 uploaded GPU results as an absolute minimum. The precise 'average' production for 1.25 hours per task would be 8/1.25 which is 6.4 tasks but I reckon the 'worst case' scenario would be if there were 4 'very low progress' tasks in play at the time of a visit. At the next visit, they should all be completed and returned (if things are normal) and 4 replacements should have started and perhaps accumulated only about 2 to 3 hours progress each.

My intention is to change my control script so that when a 7850 enabled host is visited, it calculates the number of 'PM' lines in the job log for the previous 8 hours. It will then issue an alert if it finds less than 4. I haven't done it yet but it should be a pretty simple addition to the script.

EDIT: At the moment, the control script reads a hosts file at the start of each loop to tell it what hosts to visit and what RAC to expect. If I want to withdraw a host for service or repair, I can edit the hosts file to 'comment out' any particular host. The hosts file is just a 2 column CSV list of IP address final octets and the estimated RAC. A '#' as the first character of any IP octet causes that particular host to be skipped until the '#' is removed. I could easily add a third column to each CSV line whose value would be the minimum expected number of completed GPU tasks per 8 hour period. A zero would tell the script there is no GPU to worry about and anything above zero would be a real minimum. This way, different GPUs with different limits could easily be added. This limit number could also be associated with a text string, eg '4PM' rather than just '4' to allow the script to easily handle science runs other than the current BRP6 where the identity string is indeed 'PM'. Also, I could play with different GPU task concurrencies and have a 'per host' way of adjusting the limit if it needs to change.

One thing I did learn quite a while ago was to try to plan for future changes to science runs. I've been very lucky so far. The script has handled FGRP2/3/4 changes as well as BRP5/6 without having to edit the code logic to cope with the differences.

Quote:
You still need to monitor RAC to detect upload problems and invalids, so don´t give up on the RAC.


Absolutely :-). Capability only ever gets added, doesn't it?? :-).

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 12746
Credit: 1839147599
RAC: 3514

RE: Hi Mikey, RE: I

Quote:

Hi Mikey,

Quote:

I have 15 pc's running with 11 of them crunching various Boinc Projects, logging onto each one every morning and looking at its rac gives me a quick look at how it's doing compared to yesterday or a week ago etc. Going to the different project webpages gives me a better look at how each pc is doing compared to my other pc's at the same project.

I tend to look at the number of credits allocated on a daily basis, its no more an accurate determination of 'my' throughput, but I'd say it gave as good an approximation of both my & wingpersons validated tasks as anything else.

Anyway a higher figure is much more ego boosting than a lower one:-)

I've long since gotten over the idea that I could ever outpace the top crunchers, even in the 'classic' days it was obvious that anyone with a significant number of rigs crunching 24/7 would ouptperform my kit:-)

So I'm happy to trundle along, and crunch what evers available in the astronomy line of DC projects.

What I wont do is crunch for private enterprise's that will then patent the results for their own profit. Those folks can buy their own kit and pay to run it.

The real problem is that far too many DC projects keep running into major problems with 'their' kit, leading to virtual shutdowns:-(
So far over the past few months S@H, E@H & A@H have all suffered from significant problems and outages. MW@H seems fairly stable, touch wood:-)

So throughput is compromised by server side problems, even down to my level with 2 rigs there has been a definite shortage of WU & therefore diminished throughput.

Regards,

Yeah I have been fortunate enough to have enough cash flow to sustain me, otherwise I too would have much fewer kits! I am constrained by personal commitments to switching projects fairly often, so I can't stay at one project for a very long time crunching for just it. I tend to come and go, so while it says I have been here for 10 years it is not 10 years in a row by any stretch. Most cpu type projects I can stay at until I hit my goal, but gpu ones tend to be troublesome for me. That's why rac is one of the things I keep an eye on, total credits at each project is a bigger thing for me though.

As for MW@H they do seem to be meeting their funding goals thru various means and are even working on getting new hardware that should help then keep going.

A@H is having issues, but the Admin says he is starting to see the light at the end of the tunnel, I just hope it's not a train!

I don't know anything about E@H, but S@H got some funding recently so hopefully they also can see the light.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117990921468
RAC: 21127644

RE: My intention is to

Quote:
My intention is to change my control script so that when a 7850 enabled host is visited, it calculates the number of 'PM' lines in the job log for the previous 8 hours. It will then issue an alert if it finds less than 4. I haven't done it yet but it should be a pretty simple addition to the script.


I've made the modifications to my control script so that the 'hosts' file does contain both a configurable expected number of uploaded tasks per 8 hour loop and the search string that will be used to detect a particular type of task entry in the job log file. The modifications are being tested at the moment and so far hosts are being reported as 'OK' so they must have at least 4 tasks for HD7850s and 2 tasks for GTX650s - these being the limits I chose. I'll only really know how good this is when a GPU next stops producing and the host is flagged as 'not OK - please check' :-).

The code that does the work is quite trivial really. It is just as in the code excerpt below where

$gtasks is the calculated number of gpu tasks identified in the job log
$ip is the IP address of the particular host being processed
$joblog is the full path to the einstein job log on that host
$ttime is the unix time for 8 hours ago on that host
$nm is the two letter code that identifies particular tasks in the job log (eg PM)
$ltasks is the expected lower limit for completed tasks in 8 hours for that host.

$nm and $ltasks are read in from the hosts file and are unique to each host entry. Setting $ltasks to zero in the hosts file will turn off checking completed tasks for that host.

if [ $ltasks -gt 0 ]; then
    gtasks=`ssh $ip tail -50 $joblog | awk '\$1'>$ttime | grep $nm | wc -l`
    if [ $gtasks -lt $ltasks ]; then
        (Report the potential problem - ie entry in log file and flag the IP octet in colour on the console screen)
    fi
fi

The single line that works out the value of $gtasks uses ssh (secure shell) to make a connection to a particular IP address ($ip) and then execute the command pipeline (the rest of the line) on the remote host. It returns the answer from the remote host to the control script for further processing. The ability of ssh connections like this to do all sorts of things on remote hosts is really useful.

If no problem is found (the number of tasks is not less than the set limit) nothing is reported and the script carries on as normal.

Cheers,
Gary.

cliff
cliff
Joined: 15 Feb 12
Posts: 176
Credit: 283452444
RAC: 0

Hi Mikey, RE: A@H

Hi Mikey,

Quote:


A@H is having issues, but the Admin says he is starting to see the light at the end of the tunnel, I just hope it's not a train!

I don't know anything about E@H, but S@H got some funding recently so hopefully they also can see the light.

A@H is running more or less ok, there are now WU and downloads are getting through without too many failures:-)

Came across an oddity with A@H & MW@H.. While A@H was task short, I attatched my 2nd rig to E@H.. No problems there, then A@H delivered some WU, and since E@H was running on 2 GPU's, A@H loaded 2 tasks on CPU.. It had only ever run 1 at a time prior to running E@H.. Strange, so I looked at local settings, saw I had set CPU% to 25%.. reduced it to 22% and A@H put 1 of the tasks it was crunching into waiting mode..

Weird behaviour:=/ But when E@H finished it WU and MW@H started to crunch WU it got stranger.. A@H promptly stopped its WU and went to waiting to run,
I stopped 1 MW@H task, and A@H loaded and ran 1 of its WU..

Anyway I've tried altering priority for each project, doesn't seem to make any diff... So now I've set CPU% back to 25% and have 2 tasks running and 1 E@H plus 1 MW@H..

What will happen when the E@H task completes is guestimate:-)I suspect that A@H with go to waiting to run again as soon as MW@H loads 2 WU..

Weird since it didn't use to do that before E@H was running..

Cliff,

Been there, Done that, Still no damm T Shirt.

mikey
mikey
Joined: 22 Jan 05
Posts: 12746
Credit: 1839147599
RAC: 3514

RE: Hi Mikey, RE: A@H

Quote:

Hi Mikey,

Quote:


A@H is having issues, but the Admin says he is starting to see the light at the end of the tunnel, I just hope it's not a train!

I don't know anything about E@H, but S@H got some funding recently so hopefully they also can see the light.

A@H is running more or less ok, there are now WU and downloads are getting through without too many failures:-)

Came across an oddity with A@H & MW@H.. While A@H was task short, I attatched my 2nd rig to E@H.. No problems there, then A@H delivered some WU, and since E@H was running on 2 GPU's, A@H loaded 2 tasks on CPU.. It had only ever run 1 at a time prior to running E@H.. Strange, so I looked at local settings, saw I had set CPU% to 25%.. reduced it to 22% and A@H put 1 of the tasks it was crunching into waiting mode..

Weird behaviour:=/ But when E@H finished it WU and MW@H started to crunch WU it got stranger.. A@H promptly stopped its WU and went to waiting to run,
I stopped 1 MW@H task, and A@H loaded and ran 1 of its WU..

Anyway I've tried altering priority for each project, doesn't seem to make any diff... So now I've set CPU% back to 25% and have 2 tasks running and 1 E@H plus 1 MW@H..

What will happen when the E@H task completes is guestimate:-)I suspect that A@H with go to waiting to run again as soon as MW@H loads 2 WU..

Weird since it didn't use to do that before E@H was running..

Are you talking about the 22% and 25% numbers being in the Boinc Maanger under the "while processor usage is less than..." section? If so that number is backwards to what most people think, the lower it is the less cpu crunching you do, unless it is zero in which case there is no restriction. That number tells Boinc to stop crunching it that amount of cpu is being used, not free. So UPPING the number makes sense that it would let another unit run.

cliff
cliff
Joined: 15 Feb 12
Posts: 176
Credit: 283452444
RAC: 0

Hi Mikey, RE: What

Hi Mikey,

Quote:

What will happen when the E@H task completes is guestimate:-)I suspect that A@H with go to waiting to run again as soon as MW@H loads 2 WU..

Weird since it didn't use to do that before E@H was running..

Are you talking about the 22% and 25% numbers being in the Boinc Maanger under the "while processor usage is less than..." section? If so that number is backwards to what most people think, the lower it is the less cpu crunching you do, unless it is zero in which case there is no restriction. That number tells Boinc to stop crunching it that amount of cpu is being used, not free. So UPPING the number makes sense that it would let another unit run.

Yup, gottit in one:-) And that was a clue to some of the odd behaviour, MW@H uses more CPU with the GPU than E@H, to the point that it 'seems' A@H is CPU starved when MW@H runs on 2 GPU's. Where E@H uses 0.02 CPU per WU MW@H uses 0.6xx per WU..

I'm going to up the CPU usage a bit more and see if that allows both MW@H & A@H to run at the same time with full GPU and CPU utilisation.

However given that the ambient temps are now 25C plus here I'm going to be watching CPU core temps closely, don't want another dead CPU just to run 1 more instance of A@H WU..

Just adding one more project it seems can throw up some unexpected behaviour, its a bit odd that this didn't occur prior to adding E@H, cant actually see the correlation in adding another project, since E@H & MW@H do NOT run together unless I suspend 1 WU on either project to allow both to have a single GPU.

Regards

Cliff,

Been there, Done that, Still no damm T Shirt.

mikey
mikey
Joined: 22 Jan 05
Posts: 12746
Credit: 1839147599
RAC: 3514

RE: Hi Mikey, RE: What

Quote:

Hi Mikey,

Quote:

What will happen when the E@H task completes is guestimate:-)I suspect that A@H with go to waiting to run again as soon as MW@H loads 2 WU..

Weird since it didn't use to do that before E@H was running..

Are you talking about the 22% and 25% numbers being in the Boinc Maanger under the "while processor usage is less than..." section? If so that number is backwards to what most people think, the lower it is the less cpu crunching you do, unless it is zero in which case there is no restriction. That number tells Boinc to stop crunching it that amount of cpu is being used, not free. So UPPING the number makes sense that it would let another unit run.

Yup, gottit in one:-) And that was a clue to some of the odd behaviour, MW@H uses more CPU with the GPU than E@H, to the point that it 'seems' A@H is CPU starved when MW@H runs on 2 GPU's. Where E@H uses 0.02 CPU per WU MW@H uses 0.6xx per WU..

I'm going to up the CPU usage a bit more and see if that allows both MW@H & A@H to run at the same time with full GPU and CPU utilisation.

However given that the ambient temps are now 25C plus here I'm going to be watching CPU core temps closely, don't want another dead CPU just to run 1 more instance of A@H WU..

Just adding one more project it seems can throw up some unexpected behaviour, its a bit odd that this didn't occur prior to adding E@H, cant actually see the correlation in adding another project, since E@H & MW@H do NOT run together unless I suspend 1 WU on either project to allow both to have a single GPU.

Regards

Hi Cliff,

Yeah it seems we users are the testers to see which projects play nice with each other and which ones do not, some definitely work much better together than others though!!

Upping the cpu percentage may cause more lag when you try to do other things, so go slowly and back off when it gets unbearable. It seems as though you too are headed for some Boinc only machines in your future!!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.