Are you running the 980 and 970 on the same system? You seem to have a run time ~3800 seconds as an average on your validated tasks with 1 task per card. I did a calculation of my validated times and at 3X WUs per gpu I got a mean of 11,936.4 run time for the parkes 1.52 app.
Good scaling, it seems like I have some tinkering left to do with that memory clock. All in all, I look forward to the improvements in the next iterations of the app as the performance increases have been very notable. Great job to all involved.
Hi Manuel,
Yup both are on the same rig:-)My other rig uses 1 x970 and an old GTX 760.
I use that exclusively for Asteroids@Home and Milkeyway@Home..
Anyway I've always been a tad paranoid about system temps:-) Particularly since I had 2 AMD 9370 cpus die on me last year, cause udetermined but possibly heat.
I agree the 1.52 app is a big improvement, like you I'll be happy when it makes stock..
I am now going to see how 3x performs as archae has already indicated that 3x was more efficient.
It seems RAC wise I still have a ways to go, his corei3 + GTX970 machine gets ~100,000 RAC, so the card by itself must be 95% of that with two it should be quite interesting to see where this machine ends up.
1. I'll remind you that my previous tuning of the 970 was on Perseus work, probably running on 1.39, which is quite a different beast than 1.52. My guess is that you will see a gain from 3X, but not very much. Let us know. One of these days, I should try 4X, just to see.
2. On a sample of 180 returned WUs on 1.52, my GTX 970 got a mean elapsed time of 10,095 seconds, or 2:48:15. On a formal computation of productivity from the GPU alone, with (unrealistically) zero provision for down time for reboots and such, this would give 112,975 credits/day from that GPU alone. For reference on 1.39 the same computation gave 71,459. I do run a couple of Einstein CPU jobs on the host, which in good times should more than make up for the GPU credit lost to short down-times. As I post the actual RAC for that host is 107,398 and climbing. Three cheers to Bikeman and anyone else who was involved.
Archae, I did notice that you are running a slightly older driver on your GTX 970 and that the CUDA version is 6050 as opposed to the newer drivers which are CUDA 7. Now, perhaps the parkes 1.52 app is better suited to 6050 than 7, because although I know your memory clocks area but faster than mine, our shaders cores run at 1427 as well and a similar performance can be expected. Your tasks are completing nearly ~1800 secs faster than mine over a large sample size.
Perhaps an interesting observation for others running maxwell cards.
Archae, I did notice that you are running a slightly older driver on your GTX 970 and that the CUDA version is 6050 as opposed to the newer drivers which are CUDA 7. Now, perhaps the parkes 1.52 app is better suited to 6050 than 7, because although I know your memory clocks area but faster than mine, our shaders cores run at 1427 as well and a similar performance can be expected. Your tasks are completing nearly ~1800 secs faster than mine over a large sample size.
Perhaps an interesting observation for others running maxwell cards.
I doubt my older driver is helping Parkes throughput, though it is certainly possible.
When I was doing my overclocking experiments, the Perseus Arm survey work under the then-current application gave extremely consistent results on my host (which had some affinity and priority tweaks applied to the CPU side by Process Lasso, and a lightening of the CPU task load by preferences). This allowed me to see the throughput results of a given overclock change in a tiny sample size. As I recall, the amount of GPU memory clock overclock I was able to impose had a very substantial effect, while the amount of GPU core clock increase I was able to use had a much smaller effect. Sadly I can't lay my hand on any records I kept, so can't be very specific.
Driver, host characteristics, host settings, GPU memory clock, and GPU core clock all are candidate productivity modifiers, but my estimate is that the GPU memory clock is the tallest pole in the tent in different productivity between yours and mine.
I confess, I've only tried once to automate my GPU overclocking on reboot, and so far have failed. Fortunately the machine has been well-behaved, and I have rebooted less than once a month since the beginning of the year, but I need to get a form of reboot overclock automation working one of these days.
Driver, host characteristics, host settings, GPU memory clock, and GPU core clock all are candidate productivity modifiers, but my estimate is that the GPU memory clock is the tallest pole in the tent in different productivity between yours and mine.
I confess, I've only tried once to automate my GPU overclocking on reboot, and so far have failed. Fortunately the machine has been well-behaved, and I have rebooted less than once a month since the beginning of the year, but I need to get a form of reboot overclock automation working one of these days.
As to the first part, I can imagine the GPU memory overclock is the biggest factor here as well. From your mean time you are getting tasks running ~2h50min and from mine it's ~3h30min. You had mentioned that the overclock past 3505mhz on your machine had given you ~16.X% decrease in run time, which is I think in line with the ~30min difference between our machines.
I may look into the automation of the overclock on the memory through Nvidia Inspector as I also tried that once, but it didn't work and I haven't looked into it since because I have gotten my system to run stably since I changed to MSI's afterburner software and increased the card's power target.
I may try some further tweaks after I see how far this current setup takes me, I am content for now with the stability and with the performance.
@Automating memory OC: I also tried it, but to no avail. One problem is that no GP-GPU program must be running for this to work. Hence I created a batch file which was supposed to run upon system startup and login (BOINC takes some time to load). So far I'm still setting the OC manually.. not very often, luckily.
@Performance: if you look at the memory controller utilization running Einstein BRP tasks you see very high values, in the range of 70 - 80%. That's why the memory OC helps this much. At GPU-Grid performance already starts to be impaired around 40% load.
@Cliff: run 2 WUs on your GPU and lower the temperature and/or power target. This makes it faster by making more consistent use of the GPU, but drops clock speeds slightly (offsetting some of the performance gain) and - more importantly - automatically lowers the voltage accordingly. This makes your card run more power efficient, saves you some money and reduces the largest chip degradation factor (voltage).
Or to use a car analogy: while transporting people at 150 mph you notice the fuel consumption and engine stress are rather high. Switching back to 1 concurrent WU is like taking fewer people with you on each trip to reduce car weight and thus save fuel. In my example you'd load your car fully but run at 120 mph (or so). Generally car analogies mostly fail.. but I think this one is actually quite good :)
Manuel wrote:
I'm not positive I want to overclock the cards too much as I need them to last me a while.
Same here: it's not the OC that wears out a GPU, it's voltage and temperature. Also power draw could wear out the power delivery, but that's usually not a problem.
In my case I run the GPU core at a +160 MHz offset (it's not superclocked but has a moderate factory OC) but keep the voltage in check via the power target. This actually stresses the card less than stock operation at a lower clock speed and higher voltage & temperature.
@Keith: thanks for the answer, I'll try to keep that utility in mind.
Well, after a week or so of tinkering and of trying different things out, I seem to have to come to a good setup for the machine.
Again: I'll detail some of my system specs and then my findings, along with a brief synopsis of the reason for starting this thread.
System Specs:
CPU - Intel Corei5 4690k @ 3.9ghz (x39 multiplier for all 4 cores)
GPU - 2 EVGA Nvidia GTX 970 SC (GPU clock 1403/1428, Memory clock 3705, Driver 347.88) - Stable configuration
RAM - G-Skill RipJaws 2x4GB @ 2133mhz
OS - Win 7 Pro
----Initial Issue---
I noticed that my graphics cards were staying in P2 power state and thus, throttling the GPU memory clocks to 3005mhz and not running at the stock rated 3505mhz. This means that in memory bound compute applications like E@H, there is a noticeable slowdown in processing times.
----Fixes/Observations----
I had to remove the EVGA precisionX software and install the MSI afterburner software. I had to install the Nvidia Inspector in order to have access to set memory clocks for the GPU's in P2 power state. You must ensure that E@H is not running and then at this point set your memory clock to the desired speed while it's in P0 state. At this point, whatever speed you set it at in P0 state, is the maximum speed you will be able to obtain in P2 state. For example, if I set my P0 memory speed to 3705 and I try to set my memory clock higher than 3705 for P2 state it will not work and the card will default to the highest clocks set while in P0 power state.
----Conclusion----
Though it is somewhat of a hassle, it's an interesting issue seemingly only affecting MAXWELL cards and for advanced users willing to investigate and adjust their card's properly, they will see an appreciable decrease in runtimes for the v1.52 Parkes app. Also, 3X seems to be the most efficient use of the cards power and along with the tweaks above should lead to close to the highest attainable RAC for users with these cards. Once again, YMMV according to your system setup and the thermal limits your environment may allow.
Good luck to all! I shall keep this thread updated as I tinker or make new observations as the application evolves. Thank you to those who have contributed and helped me thus far.
Side note: I have no information whether the GTX960 / GM206 and Titan X / GM200 are also affected. You don't find "proper memory clock measurements during compute loads" in the usual reviews. So if anyone knows more: I'd be happy to hear a little off-topic chat ;)
I am curious if the Titan X is affected also. I'd be thrilled to find some discussion about why the card designers hamstring P2 memory speed ONLY for distributed computing and not for gaming. Is it because of power limits? I can't see any validity in that assumption since the cards are always well short of their max power limits and can use their maximum boost speeds. I've looked around the web and forums, even the CUDA forums and have not found any discussion on this behavior. I don't know whether the design limitation is coming from Nvidia design specs or individual card manufacture decisions. Would sure like to find out why.
Hi Keith,
My tuppence worth.. blame NVidia, I can see manuf's crippleing their cards in any way given their propensity to oc their offerings.
And then there is the NV GTX 970 memory misinformation as well, and perhaps that's a clue to why P2 state is downclocked My guess its done with the NV drivers.
Hi Keith,
My tuppence worth.. blame NVidia, I can see manuf's crippleing their cards in any way given their propensity to oc their offerings.
And then there is the NV GTX 970 memory misinformation as well, and perhaps that's a clue to why P2 state is downclocked My guess its done with the NV drivers.
I'm almost positive it has to do with the way the drivers handle CUDA tasks. My other computer was also having instability issues and it wasn't until I changed the global settings in the Nvidia Control Panel to Prefer Max Performance instead of Adaptive.
I have now been running my memory offset at +300 mhz under P2 meaning I'm ~3800 mhz memory clock and have had no issues with the system rebooting and no invalids or errors as far as reported work.
The cards can run at full speeds and beyond for GPU Compute tasks, but it takes a lot of tinkering to get them there. It's a bit of a shame. Ever since Fermi cards offered such excellent all-round performance and Nvidia noticed they were undercutting themselves have they gone out of their way to handicap consumers ability to utilize the full capacities of their GPUs.
RE: Hi cliff, Are you
)
Hi Manuel,
Yup both are on the same rig:-)My other rig uses 1 x970 and an old GTX 760.
I use that exclusively for Asteroids@Home and Milkeyway@Home..
Anyway I've always been a tad paranoid about system temps:-) Particularly since I had 2 AMD 9370 cpus die on me last year, cause udetermined but possibly heat.
I agree the 1.52 app is a big improvement, like you I'll be happy when it makes stock..
Regards,
Cliff,
Been there, Done that, Still no damm T Shirt.
RE: Manuel Palacios wrote:I
)
Archae, I did notice that you are running a slightly older driver on your GTX 970 and that the CUDA version is 6050 as opposed to the newer drivers which are CUDA 7. Now, perhaps the parkes 1.52 app is better suited to 6050 than 7, because although I know your memory clocks area but faster than mine, our shaders cores run at 1427 as well and a similar performance can be expected. Your tasks are completing nearly ~1800 secs faster than mine over a large sample size.
Perhaps an interesting observation for others running maxwell cards.
RE: Archae, I did notice
)
I doubt my older driver is helping Parkes throughput, though it is certainly possible.
When I was doing my overclocking experiments, the Perseus Arm survey work under the then-current application gave extremely consistent results on my host (which had some affinity and priority tweaks applied to the CPU side by Process Lasso, and a lightening of the CPU task load by preferences). This allowed me to see the throughput results of a given overclock change in a tiny sample size. As I recall, the amount of GPU memory clock overclock I was able to impose had a very substantial effect, while the amount of GPU core clock increase I was able to use had a much smaller effect. Sadly I can't lay my hand on any records I kept, so can't be very specific.
Driver, host characteristics, host settings, GPU memory clock, and GPU core clock all are candidate productivity modifiers, but my estimate is that the GPU memory clock is the tallest pole in the tent in different productivity between yours and mine.
I confess, I've only tried once to automate my GPU overclocking on reboot, and so far have failed. Fortunately the machine has been well-behaved, and I have rebooted less than once a month since the beginning of the year, but I need to get a form of reboot overclock automation working one of these days.
RE: Driver, host
)
As to the first part, I can imagine the GPU memory overclock is the biggest factor here as well. From your mean time you are getting tasks running ~2h50min and from mine it's ~3h30min. You had mentioned that the overclock past 3505mhz on your machine had given you ~16.X% decrease in run time, which is I think in line with the ~30min difference between our machines.
I may look into the automation of the overclock on the memory through Nvidia Inspector as I also tried that once, but it didn't work and I haven't looked into it since because I have gotten my system to run stably since I changed to MSI's afterburner software and increased the card's power target.
I may try some further tweaks after I see how far this current setup takes me, I am content for now with the stability and with the performance.
@Automating memory OC: I also
)
@Automating memory OC: I also tried it, but to no avail. One problem is that no GP-GPU program must be running for this to work. Hence I created a batch file which was supposed to run upon system startup and login (BOINC takes some time to load). So far I'm still setting the OC manually.. not very often, luckily.
@Performance: if you look at the memory controller utilization running Einstein BRP tasks you see very high values, in the range of 70 - 80%. That's why the memory OC helps this much. At GPU-Grid performance already starts to be impaired around 40% load.
@Cliff: run 2 WUs on your GPU and lower the temperature and/or power target. This makes it faster by making more consistent use of the GPU, but drops clock speeds slightly (offsetting some of the performance gain) and - more importantly - automatically lowers the voltage accordingly. This makes your card run more power efficient, saves you some money and reduces the largest chip degradation factor (voltage).
Or to use a car analogy: while transporting people at 150 mph you notice the fuel consumption and engine stress are rather high. Switching back to 1 concurrent WU is like taking fewer people with you on each trip to reduce car weight and thus save fuel. In my example you'd load your car fully but run at 120 mph (or so). Generally car analogies mostly fail.. but I think this one is actually quite good :)
Same here: it's not the OC that wears out a GPU, it's voltage and temperature. Also power draw could wear out the power delivery, but that's usually not a problem.
In my case I run the GPU core at a +160 MHz offset (it's not superclocked but has a moderate factory OC) but keep the voltage in check via the power target. This actually stresses the card less than stock operation at a lower clock speed and higher voltage & temperature.
@Keith: thanks for the answer, I'll try to keep that utility in mind.
MrS
Scanning for our furry friends since Jan 2002
Well, after a week or so of
)
Well, after a week or so of tinkering and of trying different things out, I seem to have to come to a good setup for the machine.
Again: I'll detail some of my system specs and then my findings, along with a brief synopsis of the reason for starting this thread.
System Specs:
CPU - Intel Corei5 4690k @ 3.9ghz (x39 multiplier for all 4 cores)
GPU - 2 EVGA Nvidia GTX 970 SC (GPU clock 1403/1428, Memory clock 3705, Driver 347.88) - Stable configuration
RAM - G-Skill RipJaws 2x4GB @ 2133mhz
OS - Win 7 Pro
----Initial Issue---
I noticed that my graphics cards were staying in P2 power state and thus, throttling the GPU memory clocks to 3005mhz and not running at the stock rated 3505mhz. This means that in memory bound compute applications like E@H, there is a noticeable slowdown in processing times.
----Fixes/Observations----
I had to remove the EVGA precisionX software and install the MSI afterburner software. I had to install the Nvidia Inspector in order to have access to set memory clocks for the GPU's in P2 power state. You must ensure that E@H is not running and then at this point set your memory clock to the desired speed while it's in P0 state. At this point, whatever speed you set it at in P0 state, is the maximum speed you will be able to obtain in P2 state. For example, if I set my P0 memory speed to 3705 and I try to set my memory clock higher than 3705 for P2 state it will not work and the card will default to the highest clocks set while in P0 power state.
----Conclusion----
Though it is somewhat of a hassle, it's an interesting issue seemingly only affecting MAXWELL cards and for advanced users willing to investigate and adjust their card's properly, they will see an appreciable decrease in runtimes for the v1.52 Parkes app. Also, 3X seems to be the most efficient use of the cards power and along with the tweaks above should lead to close to the highest attainable RAC for users with these cards. Once again, YMMV according to your system setup and the thermal limits your environment may allow.
Good luck to all! I shall keep this thread updated as I tinker or make new observations as the application evolves. Thank you to those who have contributed and helped me thus far.
Side note: I have no
)
Side note: I have no information whether the GTX960 / GM206 and Titan X / GM200 are also affected. You don't find "proper memory clock measurements during compute loads" in the usual reviews. So if anyone knows more: I'd be happy to hear a little off-topic chat ;)
MrS
Scanning for our furry friends since Jan 2002
I am curious if the Titan X
)
I am curious if the Titan X is affected also. I'd be thrilled to find some discussion about why the card designers hamstring P2 memory speed ONLY for distributed computing and not for gaming. Is it because of power limits? I can't see any validity in that assumption since the cards are always well short of their max power limits and can use their maximum boost speeds. I've looked around the web and forums, even the CUDA forums and have not found any discussion on this behavior. I don't know whether the design limitation is coming from Nvidia design specs or individual card manufacture decisions. Would sure like to find out why.
Cheers, Keith
Hi Keith, My tuppence worth..
)
Hi Keith,
My tuppence worth.. blame NVidia, I can see manuf's crippleing their cards in any way given their propensity to oc their offerings.
And then there is the NV GTX 970 memory misinformation as well, and perhaps that's a clue to why P2 state is downclocked My guess its done with the NV drivers.
Cliff,
Been there, Done that, Still no damm T Shirt.
RE: Hi Keith, My tuppence
)
I'm almost positive it has to do with the way the drivers handle CUDA tasks. My other computer was also having instability issues and it wasn't until I changed the global settings in the Nvidia Control Panel to Prefer Max Performance instead of Adaptive.
I have now been running my memory offset at +300 mhz under P2 meaning I'm ~3800 mhz memory clock and have had no issues with the system rebooting and no invalids or errors as far as reported work.
The cards can run at full speeds and beyond for GPU Compute tasks, but it takes a lot of tinkering to get them there. It's a bit of a shame. Ever since Fermi cards offered such excellent all-round performance and Nvidia noticed they were undercutting themselves have they gone out of their way to handicap consumers ability to utilize the full capacities of their GPUs.