Just recently I've been venturing into a bit of overclocking on my GTX750 and 750Ti cards. I've thought it prudent to watch the output more closely, and found myself puzzling over a type of inclusive case I don't understand.
In the type of inclusive I believe I understand, an initial quorum contains two results which don't generate any runtime concerns on the host, nor fail any checks done by the project servers, but don't agree sufficiently closely with each other to generate validation for both. In that case a third result is sent out, and if it agrees sufficiently well with one of the two, that one "wins" and is validated, with the other discarded and invalid.
So far so good, but five of the six inclusives in my current Parkes PMPS pending list seem to be something else. In all five of those cases there is a quorum partner listed with a validate error, none besides mine shown as inclusive, and an additional result has been sent out. The "validate state" entry for the task shows "Checked, but no consensus yet".
This seems odd, and, a bit counter-productive. It seems to me that the validate failure of the quorum partner conveys very little information about the likely fate of my task, and calling it inclusive mixes the "other partner returned an obvious failure" case with the "your partner and you are both OK so far as we know, but you don't agree" case.
Is this just "the way it is"?
Here are links for some example WU's which show this state at the moment (some of which will doubtless show other conditions soon as work in progress gets returned).
inconclusive 1
inconclusive 2
inconclusive 3
inconclusive 4
inconclusive 5
By contrast, this link shows a WU in the condition I had previously understood to be the standard case--two returned tasks don't agree well enough.
inconclusive as I understand it
Copyright © 2024 Einstein@Home. All rights reserved.
inconclusives without a partner comparison
)
I have also seen that with validate errors by my wing men so my thoughts are that is the way it is.
RE: Is this just "the way
)
Yep, and it's been like this for as long as I can remember.
When there are two completed results available, the validator performs a 'sanity' check. If both have 'sane' values, the full comparison is performed. If it is immediately obvious that one result is rubbish, it is marked as a validate error and the other is given the 'no consensus' state. It shouldn't be put back to the 'waiting' state because that loses the information that there has been a preliminary check performed.
For most clarity, I guess there could be a new status category which could say something like, "Checked but partner task is invalid" in addition to the 'no consensus' status. That way there would be a clear distinction between a task that is very likely quite OK (because the companion is bad) and a task that is probably 50/50 (because either member of the pair could be bad). The way it's done does 'save' creating this extra status condition.
Cheers,
Gary.
Gary Roberts wrote:When there
)
Thanks, Gary.
Getting back to my situation of looking at inconclusives for early warning of over-zealous overclocking, I now see a couple of diagnostic points.
If a review of the parent WU to my task shows a single partner as invalid and no other task than mine as inconclusive--rest easy. There is really no evidence in this case of anything worse than any other pending--the jury is still out.
If, on the other hand, my task is one of a pair of inconclusives, initial concern is far higher. In that case a review of the recent track record of the quorum partner may shed light. If the partner has been clean, my result is highly suspect. If the partner has delivered a high rate of errors and inconclusives, then my result has more hope of being exonerated in the end.
That helps.
Thanks.
Yes, that's exactly the way
)
Yes, that's exactly the way to assess the likely outcome of any inconclusives you are seeing.
I set up an R7 260X over a month ago and tried it at 2x, 3x and 4x. It gave a couple of validate errors at 4x (with no improvement in the time per task) so I decided on 3x with 2 free CPU cores (old Q8400 host) for good measure. The validate errors stopped and things seemed to be going along fine.
Recently summer has arrived with a vengeance (daytime temps in the 30s C) and the ambient in the room is now about 7C warmer than when the machine was set up. Almost a week ago I was alerted by a falling RAC and on closer inspection I noticed about 10% of the BRP6 tasks were going to the 'no consensus' status for both tasks and on checking the quorum partners, there was no sign of the other host being a problem. Sure enough, all of mine in these pairs were eventually turning into invalids.
The card had a factory overclock and, being on Linux, there is no Afterburner GUI based tweaking tool but AMD do supply a CLI tool (aticonfig) so at least I could check freqs and temps. The temp was reported as 64C so I decided that was OK. The freqs were 1175MHz (core) and 1625MHz (mem). I reduced them to 1150/1600. The card has been running this way for several days and so far there are no further 'dual inconclusives'. It's amazing what a small change in freq can do :-).
It's also amazing what you can do with such an unreliable indicator as RAC. I have around 90 hosts being controlled by a script on a server machine. the original purpose was to save downloads by caching and deploying new data files for FGRP tasks. In the last couple of months, I've put a lot of time and effort into adding/improving problem detection and stats logging capabilities. The script can now detect and report on 16 separate conditions, some of which are hard errors and some are just warnings. Whilst quick reporting of hard errors was probably the main goal, the detection of more subtle situations, like hosts not fetching or returning the expected numbers of different types of tasks or like greater than usual RAC fluctuations that deserve a bit of a look, was also on the agenda.
This subtle problem of a small proportion of GPU tasks failing validation would not have been detected as quickly as it was without the help of the script. Most of the hosts never get looked at for weeks or even months at a time. I rely pretty much entirely on the script logs with every host being checked and logged 7 times a day. One of the conditions flagged is the RAC becoming less than 90% of what is expected. The theoretical RAC for this host is 50K and on 10th Oct it had reached 48K and was still slightly rising. By around the 13th it had dropped below 45K, was still dropping and was flagged in the logs. On the website, some invalid tasks and some double inconclusives confirmed the problem. I changed the freqs as indicated and todays RAC is back up over 47K and still rising.
Cheers,
Gary.
RE: The card had a factory
)
For the AMD AMDOverdriveCtrl is my GUI weapon of choice for ubuntu / linux. As winter is coming i may need to turn up the heating...
On my old GTX460's i found a tiny overclock started generating a couple of invalids and other weird (display mainly) problems, with almost no net RAC benefit. Other things the RAC reflects is tasks slowing as processors start to down-clock for a reason, typically heat.
I don't think boinc user interface handles invalids well .... there is virtually no headline feedback on the Simple (or advanced) Boinc manager front view about RAC delta changes, or invalid tasks - just a project total in the Simple View.
Most users i guess install and forget (which is good) but they would like to be told when something goes awry.
Seeing invalids is the first step in fixing them, i posted recently over in the boinc forum Invalid tasks, how best to raise an alert. The sooner you see them, the sooner they are likely to be fixed.
Gary Roberts wrote:It's also
)
In monitoring my little 3-host flotilla, I've upgraded my detection considerably by constructing a substitute for RAC which includes the (fluctuating) value of pending work.
My method uses spreadsheet entries and cell calculations. It does not use RAC at all, but instead uses on each time entry row the total credit awarded, and the number of pending units by application. Here at Einstein, where credit per WU is (almost) constant by application for the applications I use, and credit loss to invalid results is rare, valuing the pending work by a simple multiplication for each application works well. The final step is to convert delta credit between two rows to rate, which just means subtracting the time entry fields to get delta time and dividing.
In my case I usually make one entry each morning, and I calculate credit rates for 1 row back (which I label 1 day back--as it usually is, approximately) and 5 days back. The 5 day back one suffers less noise from random variation in completion fractions for work in active computation, while the 1 day back one reacts faster. Despite my labels, the real time divisor is not an integer number of days, but computed from the time stamps which I log at 1 minute resolution. As my logging is manual, this is necessary to allow for variations in the moment I actually capture data each day.
I don't do this on a per host basis, but with my small 3 host flotilla, the total flotilla rate values have pretty good detection power.
I'm not really pitching this to you, Gary, as it might not fit your automation model, but as the variation in pending as it alters RAC is a frequent concern here, I thought I'd share this method.
Just to jump to the chase, in Microsoft Excel the cell formula I use in the "5-day" calculation is:
=IF(ISNUMBER(A3092),(AG3092-AG3087)/(A3092-A3087),NA())
Where column A contains the observation date-time, in my case in the format [pre]17-Oct-2015 07:15[/pre]
and column AG contains the calculated sum of current awarded credit plus estimated value of pending credit. Predecessor columns contain the observed total credit awarded to date, and pending WU count by application. I've got the expected credit per WU by application hardwired in to cell formulas currently--a rather crude approach which requires tinkering each time my chosen applications change.
RE: For the AMD
)
Thanks for that - I wasn't aware it even existed. I checked and codelite is in the repos so I might try downloading and building amdovdrvctrl as suggested in the instructions.
I'm not into overclocking GPUs at all yet. A lot come with factory overclocks and it takes quite a while and a lot of fiddling to find the sweet spot. Then winter turns into summer and suddenly tasks start failing on multiple machines. I've fiddled with the AMD supplied CLI utility but in the end decided the safest option was to leave them on stock settings. The current situation was the first time I've had to downclock a factory OC.
If you go to your account page on the website, there is a 'Tasks' link to all results on the website for that user ID. I just clicked mine, expecting it to take quite a while to assemble. It came back with results 1-20 of around 23,000 surprisingly quickly. I clicked on the 'Error' state and this took a lot longer, around 15 seconds or so. I see that I have 32 'error' tasks.
I was keen to look at this because around this time yesterday there was a power glitch, enough voltage drop for long enough to make the lights flicker and to make a lot of machines crash or attempt to reboot. The beeping of computers attempting to reboot is a bit of a giveaway. I wasn't a happy camper at that point as I knew it would take quite a few hours to restore everything to proper working order. During the process I found quite a few computation errors and disks with damaged filesystems that needed 'fsck'ing manually to clean up the corruption. These comp errors were scattered over a number of machines with about half of them on just one machine and dated over a period of days to weeks beforehand. So it looks like there is a problem quite separate to the power glitch.
On checking the hostID for this particular host, it's a machine I wouldn't have expected problems on. I checked the machine and it seemed to be running OK but I did the normal thing of 'feeling the PSU' and just about burnt my fingers. The fan was frozen and it was alarmingly hot, so there's the likely reason for the errors. I replaced the PSU and then re-oiled the fan and it's now spinning very freely. I've had very good success re-oiling fans. I pencil the date in a conspicuous place for future reference and I have units still running fine around 2 years after a re-oil.
So, it is possible to use the website to very quickly assemble a fleet-wide list of errors and invalids with which to look for things going wrong. For someone with just a couple of machines, this would be pretty easy to check on a regular basis. Of course, this doesn't explain why there might be occasional validate errors or invalids. But if any pattern of a particular machine always giving some invalid results were evident, it would indicate that something on that machine was outside the comfort zone and so various hardware checks would be in order.
Cheers,
Gary.
RE: RE: For the AMD
)
The main use was for temperature control, you can reduce temperature (and power) without losing (much) performance, by dropping voltages (to as low as 1.00-1.05V) and/or increasing fan speeds.
for every project, regularly, is a chore - the invalids are just not visible unless you dig for them. It like looking for anti-truffles, best sweep them under the carpet!
In a perfect world my boinc manager (in a Scotty or maybe Holly in Red Dwarf stylee) do that check each time it calls for more work, and MP3 an output a "Captain AgentB - i think we have a problem with Einstein number 3 app-engine, it cannae take the overclock sir we, may need to throttle back a wee bit". and then i go to the website and search for clues.
Then, i would say boinc handles invalids well...
I do have a concern that left unattended in hot weather rooms get very hot and damage gets done, not everything has a thermal cut-out, and especially things like laptops have weak cooling.
Many years ago i saw a basement full of comms kit melted together when an air-con failure occurred over a weekend. Amazingly whilst many things fused together only two modems as i recall failed, but everything else was either replaced or failed within 12 months after that.
Over summer i have been running a script to log fan speeds, hdd, ssd, gpu, mb temps and load stats every 10 minutes. Unfortunately my Corsair PSU has USB monitor ports but they are pretty hard to get data from in Ubuntu (using a VM to monitor the PSU seemed a bit excessive, but it is interesting to watch power usage and fans respond as a big GPU kicks in) -
My thinking is to look at how temps change and set a threshold, like 90C for 10 minutes for the AMD-GPU and it it hits that shut-down (or suspend GPU or down-clock or ...) for 2 hours, then restart.
Long term, keeping the machine healthy there's maintenance and monitoring.
RE: archae86 wrote: In
)
One for the collective cruncher noun award. I guess that makes Gary's an Amarda, and mine a, once-a-year two boat race!
This is kind of a "real-time" RAC.
Actually i hadn't thought about what a "recent-host-health" statistic should look like, but i'll share what i monitor.
I pick an interval usually 24 hours and total all the elapsed time for tasks, logged in the job_einstein_*.txt file in the boinc data directory. Unfortunately this puts invalids, valids, pending (yes i hate invalids) all together. Errors are discarded and not even logged in that file.
If I am running full time, say 5 tasks, then the total time if all is well, be around the 5*24=120 task-crunching hours. (the value is logged in seconds but that doesn't matter)
So i divide the total time by 5 (tasks) and then by the (24hr) interval length and that gives me a number - when healthy is 0.97-1.03. If it drops off then something is not working properly eg power fails etc.
I also count the number of tasks per day by app and know that should be about the same number. Of course invalids are unhealthy and don't get spotted by any of this...
As another general health stat you could use, number of valid tasks since last error and last invalid task i guess.
I don't know how boinc calculates the "trustworthy host", but it must do something. I'll report back if i find out.
RE: Here are links for some
)
If you have a look at your referenced results, you'll see that some have had wingmen successfully complete the task, and your tasks were changed to "Completed and validated".
To hit some other points, I use BOINCtasks' history tab to identify hosts that have troubles with failing tasks, and Tthrottle on hosts that may be susceptible to potential heat damage due to inadequate cooling, such as laptops.
Click Here to see My Detailed BOINC Stats