greetings I run GPU only work. Einstein on my AMD 390x gaming & Milkyway on my Nvidia 1060 and have set my GPU to 0.25, on Einstein project settings. It runs four iterations, on the ADM one on Nvidia. I have noticed on the AMD that my work product has a substantial amount of invalids. ..What could I change to reduce the invalids and still have a high utilization of GPU's
I have noticed on the AMD that my work product has a substantial amount of invalids. ..What could I change to reduce the invalids and still have a high utilization of GPU's
You could set it to run only 2 or 1 tasks at a time and see if that would bring down the amount of invalids. As you are running Windows on that host you could also easily monitor temperatures of those GPUs (with GPU-Z software or similar). That could give some hints how the cards are handling the load.
... At 90% GPU load drops and calculations are run on the CPU.
The calculations are performed in two stages which take approximately 90% and 10% of the total time respectively. The initial stage (performed in single precision) identifies potential candidate signals in the data. The ten most important candidates are then subjected to a re-evaluation in the 'followup' stage. This is performed in double precision. It is also performed on the GPU (not the CPU) if the GPU has double precision capability.
Most of the GPUs that are able to crunch Einstein tasks would likely have some level of double precision capability so I imagine that performing the followup stage on the CPU would be a relatively rare event. There are big tables in Wikipedia for both nvidia and AMD where you can check the double precision capability of any particular GPU.
thx for your reply I have my preferences setting for the GPU's at .33 this then runs three (.33x3)=.99 of gpu capacity. my temps for the amd are 167 degrees F & cpu 126 F these two 7 virtual of the cpu and .99 of the adm gpu run the einstein exclusively and the 1060 1.0 and runs 132.6 F .667of cpu milkyway exclusively only using less then 5% of the gpu's vram. overall the temps for the system MB M.2 ssd's are all never above 120 F. So all my components are within operating range and in my view are not accounting for the invalids. but a good question.
I have used this setup successfully at these temps performing dual use of GPU's at a much high vram and temps for extended duration in other applications.
What is a reasonable/acceptable % of invalids that the system expects across 100's of setups. Maybe the admin could post this or can I see that % in looking up other user account info. I haven't tried that but maybe I'm clicking along just fine. nearly at 15 million for the project.
What is a reasonable/acceptable % of invalids that the system expects across 100's of setups. Maybe the admin could post this or can I see that % in looking up other user account info. I haven't tried that but maybe I'm clicking along just fine. nearly at 15 million for the project.
Something on the order of 0.5% invalid, with appreciable fluctuation, seems to be normal behavior here. I don't think anyone really understands why.
Your system is not "clicking along just fine". Reviewing the Status column at:
Not only do we see many entries of "Completed, marked as invalid", which is the usual notation when your returned work unit was compared with a quorum partner, and the two differed too much to be accepted, so a tie-breaker unit was sent to a third host, and that result agreed better with your original partner than with you.
But in your case we see many entries marked "Validate error". In this case your returned work flunked a sanity check which is performed after the quorum is ready for checking, but before comparison between units.
On healthy systems, the ratio of Invalid results to Valid results showing on the tasks page runs a few tenths of a percent, but zero of the invalid results are "validate error". On your system at a snapshot as I type the ratio of invalid results to valid results is 21%. Worse yet, I count 31 "validate error" units since the first of February (and some in that period will already have disappeared from the log because the WU has been cleared by other successful returns).
Your system is not healthy. The easiest way to get that result is to have the GPU running faster than it is able successfully to process the work under existing conditions, where "existing conditions" includes the specific application being run, the data presented to that application, the system temperature, the health of the GPUs own cooling provision, settings for fan speeds, settings for clock speeds, adequacy and good behavior of the system power supply, etc.
My personal suggestion is that you employ MSIAfterburner, or some other overclocking tool of your choice, to reduce your core clock and memory clock rates by 10% and monitor results for a day. If you see a drastic reduction in the rate of production of both types of invalid results, then you have confirmation that you are in fact running faster than you can--and you can work on the details.
Mind you--I suggest reducing the actual clock rates by 10%, not the overclock (if any). My advice applies even if you are already underclocking (though I doubt that).
In giving a rough typical invalid rate of 0.5%, I should have specified that I was speaking of Einstein GPU applications. It has been a while since I've been involved in Einstein CPU application work, but recall that healthy systems produced essentially zero invalid results.
thank you so much for your extended reply. I have since reduced the raito for .25 to .33 then finally today .5 considering going to 1.0. I use a Gigabyte specific under-clock which I was running about 5% my temps now for 390x and 1060 61c and 45c respectively.
I will reduce this further by going to 1.0 tonight since I reread your post.
... At 90% GPU load drops and calculations are run on the CPU.
The calculations are performed in two stages which take approximately 90% and 10% of the total time respectively. The initial stage (performed in single precision) identifies potential candidate signals in the data. The ten most important candidates are then subjected to a re-evaluation in the 'followup' stage. This is performed in double precision. It is also performed on the GPU (not the CPU) if the GPU has double precision capability. Most of the GPUs that are able to crunch Einstein tasks would likely have some level of double precision capability so I imagine that performing the followup stage on the CPU would be a relatively rare event. There are big tables in Wikipedia for both nvidia and AMD where you can check the double precision capability of any particular GPU.
Gary,
I ran across an AMD video card with Rx 470-class speed single precision but huge double-precision capacity (much higher than RX 470). I am guessing that speeding up the last 10% would not be a major time gain? (Currently, all my AMD cards snap from 90% to 100% practically instantly) on GR #1 gpu tasks.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
... I am guessing that speeding up the last 10% would not be a major time gain?
The comment you quoted was written a long time ago when the app behaved differently to what happens today. If I remember correctly, the followup stage used to take around 20-40 seconds. These days there is hardly any delay at all - as you have noted.
A high double precision capability will have essentially no effect on crunch time.
Select this app:"Gamma-ray
)
Select this app:
"Gamma-ray pulsar binary search #1 (GPU)"
Uncheck the CPU option if you don't want to run them as well as this option:
"Run CPU versions of applications for which GPU versions are available:"
Some CPU is needed for E@H GPU apps. At 90% GPU load drops and calculations are run on the CPU.
greetings I run GPU only
)
greetings I run GPU only work. Einstein on my AMD 390x gaming & Milkyway on my Nvidia 1060 and have set my GPU to 0.25, on Einstein project settings. It runs four iterations, on the ADM one on Nvidia. I have noticed on the AMD that my work product has a substantial amount of invalids. ..What could I change to reduce the invalids and still have a high utilization of GPU's
P_2 wrote:I have noticed on
)
You could set it to run only 2 or 1 tasks at a time and see if that would bring down the amount of invalids. As you are running Windows on that host you could also easily monitor temperatures of those GPUs (with GPU-Z software or similar). That could give some hints how the cards are handling the load.
mmonnin wrote:... At 90% GPU
)
The calculations are performed in two stages which take approximately 90% and 10% of the total time respectively. The initial stage (performed in single precision) identifies potential candidate signals in the data. The ten most important candidates are then subjected to a re-evaluation in the 'followup' stage. This is performed in double precision. It is also performed on the GPU (not the CPU) if the GPU has double precision capability.
Most of the GPUs that are able to crunch Einstein tasks would likely have some level of double precision capability so I imagine that performing the followup stage on the CPU would be a relatively rare event. There are big tables in Wikipedia for both nvidia and AMD where you can check the double precision capability of any particular GPU.
Cheers,
Gary.
thx for your reply I have my
)
thx for your reply I have my preferences setting for the GPU's at .33 this then runs three (.33x3)=.99 of gpu capacity. my temps for the amd are 167 degrees F & cpu 126 F these two 7 virtual of the cpu and .99 of the adm gpu run the einstein exclusively and the 1060 1.0 and runs 132.6 F .667of cpu milkyway exclusively only using less then 5% of the gpu's vram. overall the temps for the system MB M.2 ssd's are all never above 120 F. So all my components are within operating range and in my view are not accounting for the invalids. but a good question.
I have used this setup successfully at these temps performing dual use of GPU's at a much high vram and temps for extended duration in other applications.
What is a reasonable/acceptable % of invalids that the system expects across 100's of setups. Maybe the admin could post this or can I see that % in looking up other user account info. I haven't tried that but maybe I'm clicking along just fine. nearly at 15 million for the project.
P_2 wrote:What is a
)
Something on the order of 0.5% invalid, with appreciable fluctuation, seems to be normal behavior here. I don't think anyone really understands why.
Your system is not "clicking along just fine". Reviewing the Status column at:
https://einsteinathome.org/host/12621048/tasks/5/0?sort=desc&order=Sent
Not only do we see many entries of "Completed, marked as invalid", which is the usual notation when your returned work unit was compared with a quorum partner, and the two differed too much to be accepted, so a tie-breaker unit was sent to a third host, and that result agreed better with your original partner than with you.
But in your case we see many entries marked "Validate error". In this case your returned work flunked a sanity check which is performed after the quorum is ready for checking, but before comparison between units.
On healthy systems, the ratio of Invalid results to Valid results showing on the tasks page runs a few tenths of a percent, but zero of the invalid results are "validate error". On your system at a snapshot as I type the ratio of invalid results to valid results is 21%. Worse yet, I count 31 "validate error" units since the first of February (and some in that period will already have disappeared from the log because the WU has been cleared by other successful returns).
Your system is not healthy. The easiest way to get that result is to have the GPU running faster than it is able successfully to process the work under existing conditions, where "existing conditions" includes the specific application being run, the data presented to that application, the system temperature, the health of the GPUs own cooling provision, settings for fan speeds, settings for clock speeds, adequacy and good behavior of the system power supply, etc.
My personal suggestion is that you employ MSIAfterburner, or some other overclocking tool of your choice, to reduce your core clock and memory clock rates by 10% and monitor results for a day. If you see a drastic reduction in the rate of production of both types of invalid results, then you have confirmation that you are in fact running faster than you can--and you can work on the details.
Mind you--I suggest reducing the actual clock rates by 10%, not the overclock (if any). My advice applies even if you are already underclocking (though I doubt that).
In giving a rough typical
)
In giving a rough typical invalid rate of 0.5%, I should have specified that I was speaking of Einstein GPU applications. It has been a while since I've been involved in Einstein CPU application work, but recall that healthy systems produced essentially zero invalid results.
thank you so much for your
)
thank you so much for your extended reply. I have since reduced the raito for .25 to .33 then finally today .5 considering going to 1.0. I use a Gigabyte specific under-clock which I was running about 5% my temps now for 390x and 1060 61c and 45c respectively.
I will reduce this further by going to 1.0 tonight since I reread your post.
Gary Roberts wrote:mmonnin
)
Gary,
I ran across an AMD video card with Rx 470-class speed single precision but huge double-precision capacity (much higher than RX 470). I am guessing that speeding up the last 10% would not be a major time gain? (Currently, all my AMD cards snap from 90% to 100% practically instantly) on GR #1 gpu tasks.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!
Tom M wrote:... I am guessing
)
The comment you quoted was written a long time ago when the app behaved differently to what happens today. If I remember correctly, the followup stage used to take around 20-40 seconds. These days there is hardly any delay at all - as you have noted.
A high double precision capability will have essentially no effect on crunch time.
Cheers,
Gary.