re-read your posts about bringing up your GTX 460 hosts.
Thanks, I'm glad they were of use to you.
Quote:
Initially I wanted the shrouded-type cooler so it would eject the heat out of the back of the case, but I have a lot of fans in the case anyway and am now thinking that it wouldn't much matter if I had gone with the blower-type that you mentioned.
My 460 cards are Gigabyte, and although they are much longer, the fan and exhaust arrangements appear similar to the new 660 card I've ordered. There is a port on the back panel, and it catches considerable side-ejection airflow from one of the fans. So more of the card heat exits the back of the case immediately than one might think--though it certainly does share considerable air with the case as a whole.
As the new card is at least slightly lower power than the old, and has similar air handling, I expect it to behave well thermally in my particular setup. But I have appreciable case fan capacity, use an aftermarket cooler and run my CPUs at stock settings, so I am far from worst case on cooling requirements.
I've had my GTX 660 for a bit over a week now, and after initially just bringing it up, have spent some days taking productivity and power consumption data under controlled conditions for a broad range of concurrent CPU and GPU tasks.
My hope was that I'd get a moderate power consumption reduction and a moderate productivity increase compared to my previous 460 installation on the same host. My observations support somewhat better power reduction than I'd hoped, but rather less productivity improvement.
The response of GPU productivity to varying the number of CPU tasks gave me some surprises. Gary Roberts has posted that the Kepler family (in particular his favorite the GTX 650) don't need you to "reserve a core". But for my rig, the truly surprising effect when running a single GPU task was that each added CPU job (going from zero through four) slightly DECREASED the GPU task average execution time. That's right--there was a bonus improvement in GPU productivity to add to the actual CPU productivity. At higher simultaneous GPU task counts this effect goes away, starting at the higher CPU task count end, which gives the overall productivity picture a character best portrayed in a graph.
I chose nearly the simplest possible installation protocol. Without draining the queue, uninstalling the graphics driver, or so on, I just suspended BOINC projects, powered down, swapped the cards and booted. First boot was very slow, and more than one application which talks to the graphics hardware complained, while I was greeted with an announcement that a driver was being installed. I ignored the complaints, thinking much might change by next boot.
As I was not at the time running the very latest generally approved NVIDA driver, I next ran that install, rebooted, and finally told BOINC to resume Einstein. By this time things such as SpeedFan and GPU-Z which had been unhappy, became happy again without my intervention.
One thing about my system which may have made my simple changeover different than some is that it does not actually drive the monitor from the add-on graphics card, but rather from the built-in graphics on my Sandy Bridge CPU.
To keep single posts to moderate length, I'll put more details on the system configuration and measurement methods in another post, and post actual data (mostly in graphical form) in a third.
The host which I've changed from a GTX460 to GTX660 is this one which I call Stoll7. It runs Windows 7 Home Premium x64, with SP1 installed. The CPU is a 4-core non-HT Sandy Bridge of nominal 3.3 GHz speed running stock (no overclock, RAM fiddling, or such) with 8 GBytes of RAM. The motherboard is small and cheap, ASRock Z68M/USB3. The Z68 chipset means it is limited to PCI Express 2.0, and thus may fail to support as much bus traffic as the GTX660, which will use 3.0, could benefit from. As the build is modest in peripherals (a single HD, single optical drive, single low-power SSD and not much else) and has an efficient though over-capable supply (Seasonic X650--a gold-rated supply) my system level power numbers may be on the low side for general-use systems, though higher than attainable in purpose-built crunch for economy configurations. The new graphics card is the GIGABYTE GV-N660OC-2GD, a 2 Gigabyte model with some factory overclocking. For these tests I did not tamper with any of the performance settings. I had hoped for quiet fans, and never heard them above my chassis and CPU fans.
During the tests I disabled most of the background programs I usually run, but did continue to operate a copy of Process Lasso, raising the priority of the CPU task which collaborates in the GPU tasks.
Power measurements come from a Kill-a-Watt EZ model, which possesses crucially a reset button enabling time-averaging without reboot. I recorded "eyeball average" of the real-time watts number, and also the $/month estimated from measurement throughout a given configuration, and the kWHr and hh:mm for that same period. While for my situation the kWHr plus time give me the best resolution over multi-hour periods, the cost number reached useful resolution much quicker, and allowed me to watch for any systematic change suggesting an inadvertent configuration change. My actual system power range across the configurations of interest ranged from about 140 to 210 watts (the monitor is not included) and the resolution of my power measurements in data I'm actually reporting was typically in the half percent to one percent range.
I should perhaps mention that the Kill-a-Watt actually does independent measurement of the amperage and voltage waveforms, and "does the math", so that, for example, it provides valid wattage for loads with power factors far from one such as CFL lights, and most other things that have a transformer in them. As my power supply is power factor corrected, the power factor aspect of this was not important, but my utility provided wall socket voltages ranging from at least 119.2 to 122.5 during this test, so actually measuring voltage was important to avoid degrading the wattage accuracy.
For these tests, the GPU tasks were exclusively BRP of the Arecibo series "Binary Radio Pulsar Search (Arecibo) v1.33 (BRP4cuda32nv301)", and the CPU tasks were exclusively GW of the "Gravitational Wave S6 LineVeto search (extended) v1.04 (SSE2)" type. I believe that during the time period of this testing the "work content" of these jobs was very, very consistent, with any variation far below the noise arising from my measurement resolution limitations. At low GPU and CPU simultaneous job counts the elapsed time in a configuration was extremely closely clustered. I generally took at least two batches of CPU jobs to completion, with as many GPU jobs as that allowed, and did proper averaging in a spreadsheet to get my elapsed times. I generally offset the start time of jobs I used in the performance calculations by something like three minutes, and only started power measurement when a stable configuration was running, and stopped measurement before disturbing the configuraiton.
My productivity calculations are in terms of cobblestones/day, using a credit of 500 per GPU job and 250 per CPU job, with actual elapsed time as reported to my copy of Fred's BOINCTasks. It is possible that there is a systematic error in my productivity estimates, but I believe the configurations comparisons are valid.
Are you running 2 tasks at once? I haven't used any of the GTX6xx (mostly GTX460's here) but am thinking of upgrading so am interested in your observations.
Looking at the task times for your host (roughly 3,600s to 5,100s), also are you reserving a CPU core to feed the GPU?
Herewith the promised graphical representation of results.
RAC vs. number of GPU tasks and number CPU tasks
System power consumption vs. number of GPU tasks and number CPU tasks
Power Efficiency vs. number of GPU tasks and number CPU tasks
And lastly two sort of "state-space" plots, showing the available points in a power vs. RAC plot (so the points for both plots are identical, but the labels differ
In the early stages I am not sure my configuration control and measurement logging were quite so good as later--and am in consequence re-taking some points. I'll re-capture the graphs and update the photobucket posts, so the images, if refreshed, should update in my immediate previous post tagged 28 Mar 2013 21:26:29 UTC. I am currently in progress in re-doing the measurement of the point for 4 GPU tasks/3 CPU tasks, and have intentions of redoing the 1/2 and 2/2 points as well.
I also intend to have a try at 5(!) simultaneous GPU tasks, though I don't expect that to be a desired configuration.
I'd be very happy to answer configuration and measurement method questions, as making these data useful or interesting to people here is my primary intention. I'll consider suggestions for additional measurements.
Are you running 2 tasks at once? I haven't used any of the GTX6xx (mostly GTX460's here) but am thinking of upgrading so am interested in your observations.
You got your question in before I got my graphs posted. As I hope you will see in them, I have so far run configurations with 1,2,3, and 4 GPU tasks running simultaneously. As it is a 2 GByte card, and is not providing monitor graphics, the card probably won't run out of RAM until well after the increase in task count is counterproductive. So far it appears that the turnover point varies with CPU job count--with at least a slight improvement at zero CPU jobs even going from three to for GPU tasks, while at high CPU job counts the total system productivity clusters tightly once GPU count is increased beyond one, but three seems the overall favorite on my configuration.
Quote:
Looking at the task times for your host (roughly 3,600s to 5,100s), also are you reserving a CPU core to feed the GPU?
I have not done any task affinity work here. Process Lasso is capable of it, but to get a simple comparison here I have not used that feature during these tests. If by "reserving a CPU core" you actually mean running fewer CPU jobs than the number of CPU cores available, then, as you can see, I have run tests across the full range.
I also intend to have a try at 5(!) simultaneous GPU tasks, though I don't expect that to be a desired configuration.
I'd be very happy to answer configuration and measurement method questions, as making these data useful or interesting to people here is my primary intention. I'll consider suggestions for additional measurements.
Very interesting results, thanks for putting them up and I´m tempted to order a power meter now!
What percentage of the power do you think is used by the GPU?
I would be interested to know if the delta in power use for 0->1 GPU tasks is the same as 1->2 tasks.
Also do you have some earlier GTX 460 figures to put on any of the graphs to show the difference in efficiency?
Very interesting results, thanks for putting them up and I´m tempted to order a power meter now!
What percentage of the power do you think is used by the GPU?
I would be interested to know if the delta in power use for 0->1 GPU tasks is the same as 1->2 tasks.
Also do you have some earlier GTX 460 figures to put on any of the graphs to show the difference in efficiency?
I currently advocate specifically the Kill-a-Watt meter, and for this purpose, sadly, the more expensive variant with the reset button, the "EZ" mark, and the P4460 model number. Good news--it is currently much cheaper at Amazon than I remember. (mods--if this is too much like a commercial message I apologize abjectly and submit meekly to whatever corrective measure you deem fit).
I have halfway decent idle power measurements for the box as it was with the GTX460 on board--73 watts, and less carefully taken with the new GTX660 on board--roughly 50 watts. While I think the 23 watt difference is pretty safely in graphics card idle, I have no in-house method to estimate how much of what is left is in the graphics card. I also don't know how much of the approximately 95 watt increment from zero to one GPU job is in the graphics card, how much in the CPU, and how much in the chipset and memory subsystems supporting the greater level of system activity caused by the graphics card. Oh, plus likely a bit greater loss in the power supply as well. I'll hazard a guess that the great majority is in the graphics card proper (guesstimate 80 to 90 watts), as my power supply is already in a rather efficient region and is actually moving into a more efficient region for that comparison, I think the CPU usage rather modest, and I doubt the system increment is more than a very few watts.
I took lots of GTX460 data in the past across a range of GPU and CPU tasks--but sadly the big data set is not useful here for direct comparison, as the current application differs materially. I do have decent data for the operating point I last used on the 460, which was obviously with the current applications. I was running 2 GPU and 3 CPU jobs, was burning 225 watts, and believe by the same estimation method the productivity to have been about 40,314 cobblestones/day, so a system power efficiency at that operating point of 179 RAC/watt. If I choose to match output and take as much power improvement as available, this suggests I can run 3 GPU/1 CPU at 41087 RAC and 175 watt, for a savings of 50 watts with a very slightly higher RAC. Of course, if I fail to restrain myself, and climb to higher output levels with rapidly diminishing power efficiency, the power advantage drops much faster than the output advantage climbs. This comparison also fails to capture that the previous box, also, doubtless offered higher power efficiency when running fewer CPU jobs than my recent level (in fact I was running it at 3+1 until around Christmas, when my winter liking for more power dissipation coincided with the the "Christmas Gift" of exceptionally high CPU credits until a revised application was calibrated here at Einstein.
I've extended my measurements of GTX660 performance on the same 4-core Sandy Bridge system during the last week. While it was my original intention to post updated graphs to Photobucket under the original file names, and suggest people wishing to see extended and corrected data just to refresh their browser, I learned I had not turned off a Photobucket feature which twists the files names.
The main extension since my previous post was to take data with 5 GPU tasks running. I also reran measurement for the points which looked most likely to be wrong based on kinks in curves and the like. Most of re-measurements gave imperceptibly different results from the first trial. My impression is that remaining kinks in this second set of graphs are real.
Here is the graph of perhaps greatest interest for pure performance reasons:
Gary Robert's observation on the GTX650 and generalization to the Kepler class that one does not improve total output by "reserving a core" as people are wont to term running fewer CPU tasks than the number of cores is borne out here: the highest RAC point is with three GPU tasks and with all four cores running CPU jobs. Focusing on the left axis, one gets the answer that each addition of a GPU parallel task gets more output, even going from 4 to 5, in the case that no CPU jobs are running. However, for number of CPU jobs greater than zero, three GPU jobs is usually best for total output at a given number of CPU jobs.
This leads in to the observation that from a pure power efficiency point of view, adding CPU jobs to a given number of GPU jobs in nearly every case lowers power efficiency.
I'll post some more graphs regarding power tradeoffs in an additional post, and add comments, and some corrections to errors I made in earlier comments.
Loard Nikon wrote:re-read
)
Thanks, I'm glad they were of use to you.
My 460 cards are Gigabyte, and although they are much longer, the fan and exhaust arrangements appear similar to the new 660 card I've ordered. There is a port on the back panel, and it catches considerable side-ejection airflow from one of the fans. So more of the card heat exits the back of the case immediately than one might think--though it certainly does share considerable air with the case as a whole.
As the new card is at least slightly lower power than the old, and has similar air handling, I expect it to behave well thermally in my particular setup. But I have appreciable case fan capacity, use an aftermarket cooler and run my CPUs at stock settings, so I am far from worst case on cooling requirements.
I've had my GTX 660 for a bit
)
I've had my GTX 660 for a bit over a week now, and after initially just bringing it up, have spent some days taking productivity and power consumption data under controlled conditions for a broad range of concurrent CPU and GPU tasks.
My hope was that I'd get a moderate power consumption reduction and a moderate productivity increase compared to my previous 460 installation on the same host. My observations support somewhat better power reduction than I'd hoped, but rather less productivity improvement.
The response of GPU productivity to varying the number of CPU tasks gave me some surprises. Gary Roberts has posted that the Kepler family (in particular his favorite the GTX 650) don't need you to "reserve a core". But for my rig, the truly surprising effect when running a single GPU task was that each added CPU job (going from zero through four) slightly DECREASED the GPU task average execution time. That's right--there was a bonus improvement in GPU productivity to add to the actual CPU productivity. At higher simultaneous GPU task counts this effect goes away, starting at the higher CPU task count end, which gives the overall productivity picture a character best portrayed in a graph.
I chose nearly the simplest possible installation protocol. Without draining the queue, uninstalling the graphics driver, or so on, I just suspended BOINC projects, powered down, swapped the cards and booted. First boot was very slow, and more than one application which talks to the graphics hardware complained, while I was greeted with an announcement that a driver was being installed. I ignored the complaints, thinking much might change by next boot.
As I was not at the time running the very latest generally approved NVIDA driver, I next ran that install, rebooted, and finally told BOINC to resume Einstein. By this time things such as SpeedFan and GPU-Z which had been unhappy, became happy again without my intervention.
One thing about my system which may have made my simple changeover different than some is that it does not actually drive the monitor from the add-on graphics card, but rather from the built-in graphics on my Sandy Bridge CPU.
To keep single posts to moderate length, I'll put more details on the system configuration and measurement methods in another post, and post actual data (mostly in graphical form) in a third.
The host which I've changed
)
The host which I've changed from a GTX460 to GTX660 is this one which I call Stoll7. It runs Windows 7 Home Premium x64, with SP1 installed. The CPU is a 4-core non-HT Sandy Bridge of nominal 3.3 GHz speed running stock (no overclock, RAM fiddling, or such) with 8 GBytes of RAM. The motherboard is small and cheap, ASRock Z68M/USB3. The Z68 chipset means it is limited to PCI Express 2.0, and thus may fail to support as much bus traffic as the GTX660, which will use 3.0, could benefit from. As the build is modest in peripherals (a single HD, single optical drive, single low-power SSD and not much else) and has an efficient though over-capable supply (Seasonic X650--a gold-rated supply) my system level power numbers may be on the low side for general-use systems, though higher than attainable in purpose-built crunch for economy configurations. The new graphics card is the GIGABYTE GV-N660OC-2GD, a 2 Gigabyte model with some factory overclocking. For these tests I did not tamper with any of the performance settings. I had hoped for quiet fans, and never heard them above my chassis and CPU fans.
During the tests I disabled most of the background programs I usually run, but did continue to operate a copy of Process Lasso, raising the priority of the CPU task which collaborates in the GPU tasks.
Power measurements come from a Kill-a-Watt EZ model, which possesses crucially a reset button enabling time-averaging without reboot. I recorded "eyeball average" of the real-time watts number, and also the $/month estimated from measurement throughout a given configuration, and the kWHr and hh:mm for that same period. While for my situation the kWHr plus time give me the best resolution over multi-hour periods, the cost number reached useful resolution much quicker, and allowed me to watch for any systematic change suggesting an inadvertent configuration change. My actual system power range across the configurations of interest ranged from about 140 to 210 watts (the monitor is not included) and the resolution of my power measurements in data I'm actually reporting was typically in the half percent to one percent range.
I should perhaps mention that the Kill-a-Watt actually does independent measurement of the amperage and voltage waveforms, and "does the math", so that, for example, it provides valid wattage for loads with power factors far from one such as CFL lights, and most other things that have a transformer in them. As my power supply is power factor corrected, the power factor aspect of this was not important, but my utility provided wall socket voltages ranging from at least 119.2 to 122.5 during this test, so actually measuring voltage was important to avoid degrading the wattage accuracy.
For these tests, the GPU tasks were exclusively BRP of the Arecibo series "Binary Radio Pulsar Search (Arecibo) v1.33 (BRP4cuda32nv301)", and the CPU tasks were exclusively GW of the "Gravitational Wave S6 LineVeto search (extended) v1.04 (SSE2)" type. I believe that during the time period of this testing the "work content" of these jobs was very, very consistent, with any variation far below the noise arising from my measurement resolution limitations. At low GPU and CPU simultaneous job counts the elapsed time in a configuration was extremely closely clustered. I generally took at least two batches of CPU jobs to completion, with as many GPU jobs as that allowed, and did proper averaging in a spreadsheet to get my elapsed times. I generally offset the start time of jobs I used in the performance calculations by something like three minutes, and only started power measurement when a stable configuration was running, and stopped measurement before disturbing the configuraiton.
My productivity calculations are in terms of cobblestones/day, using a credit of 500 per GPU job and 250 per CPU job, with actual elapsed time as reported to my copy of Fred's BOINCTasks. It is possible that there is a systematic error in my productivity estimates, but I believe the configurations comparisons are valid.
Are you running 2 tasks at
)
Are you running 2 tasks at once? I haven't used any of the GTX6xx (mostly GTX460's here) but am thinking of upgrading so am interested in your observations.
Looking at the task times for your host (roughly 3,600s to 5,100s), also are you reserving a CPU core to feed the GPU?
Herewith the promised
)
Herewith the promised graphical representation of results.
RAC vs. number of GPU tasks and number CPU tasks
System power consumption vs. number of GPU tasks and number CPU tasks
Power Efficiency vs. number of GPU tasks and number CPU tasks
And lastly two sort of "state-space" plots, showing the available points in a power vs. RAC plot (so the points for both plots are identical, but the labels differ
Grouped first by number of GPU tasks
Then grouped by number of CPU tasks
In the early stages I am not
)
In the early stages I am not sure my configuration control and measurement logging were quite so good as later--and am in consequence re-taking some points. I'll re-capture the graphs and update the photobucket posts, so the images, if refreshed, should update in my immediate previous post tagged 28 Mar 2013 21:26:29 UTC. I am currently in progress in re-doing the measurement of the point for 4 GPU tasks/3 CPU tasks, and have intentions of redoing the 1/2 and 2/2 points as well.
I also intend to have a try at 5(!) simultaneous GPU tasks, though I don't expect that to be a desired configuration.
I'd be very happy to answer configuration and measurement method questions, as making these data useful or interesting to people here is my primary intention. I'll consider suggestions for additional measurements.
Neil Newell wrote:Are you
)
You got your question in before I got my graphs posted. As I hope you will see in them, I have so far run configurations with 1,2,3, and 4 GPU tasks running simultaneously. As it is a 2 GByte card, and is not providing monitor graphics, the card probably won't run out of RAM until well after the increase in task count is counterproductive. So far it appears that the turnover point varies with CPU job count--with at least a slight improvement at zero CPU jobs even going from three to for GPU tasks, while at high CPU job counts the total system productivity clusters tightly once GPU count is increased beyond one, but three seems the overall favorite on my configuration.
I have not done any task affinity work here. Process Lasso is capable of it, but to get a simple comparison here I have not used that feature during these tests. If by "reserving a CPU core" you actually mean running fewer CPU jobs than the number of CPU cores available, then, as you can see, I have run tests across the full range.
RE: I also intend to have
)
Very interesting results, thanks for putting them up and I´m tempted to order a power meter now!
What percentage of the power do you think is used by the GPU?
I would be interested to know if the delta in power use for 0->1 GPU tasks is the same as 1->2 tasks.
Also do you have some earlier GTX 460 figures to put on any of the graphs to show the difference in efficiency?
AgentB wrote:Very interesting
)
I currently advocate specifically the Kill-a-Watt meter, and for this purpose, sadly, the more expensive variant with the reset button, the "EZ" mark, and the P4460 model number. Good news--it is currently much cheaper at Amazon than I remember. (mods--if this is too much like a commercial message I apologize abjectly and submit meekly to whatever corrective measure you deem fit).
I have halfway decent idle power measurements for the box as it was with the GTX460 on board--73 watts, and less carefully taken with the new GTX660 on board--roughly 50 watts. While I think the 23 watt difference is pretty safely in graphics card idle, I have no in-house method to estimate how much of what is left is in the graphics card. I also don't know how much of the approximately 95 watt increment from zero to one GPU job is in the graphics card, how much in the CPU, and how much in the chipset and memory subsystems supporting the greater level of system activity caused by the graphics card. Oh, plus likely a bit greater loss in the power supply as well. I'll hazard a guess that the great majority is in the graphics card proper (guesstimate 80 to 90 watts), as my power supply is already in a rather efficient region and is actually moving into a more efficient region for that comparison, I think the CPU usage rather modest, and I doubt the system increment is more than a very few watts.
I took lots of GTX460 data in the past across a range of GPU and CPU tasks--but sadly the big data set is not useful here for direct comparison, as the current application differs materially. I do have decent data for the operating point I last used on the 460, which was obviously with the current applications. I was running 2 GPU and 3 CPU jobs, was burning 225 watts, and believe by the same estimation method the productivity to have been about 40,314 cobblestones/day, so a system power efficiency at that operating point of 179 RAC/watt. If I choose to match output and take as much power improvement as available, this suggests I can run 3 GPU/1 CPU at 41087 RAC and 175 watt, for a savings of 50 watts with a very slightly higher RAC. Of course, if I fail to restrain myself, and climb to higher output levels with rapidly diminishing power efficiency, the power advantage drops much faster than the output advantage climbs. This comparison also fails to capture that the previous box, also, doubtless offered higher power efficiency when running fewer CPU jobs than my recent level (in fact I was running it at 3+1 until around Christmas, when my winter liking for more power dissipation coincided with the the "Christmas Gift" of exceptionally high CPU credits until a revised application was calibrated here at Einstein.
I've extended my measurements
)
I've extended my measurements of GTX660 performance on the same 4-core Sandy Bridge system during the last week. While it was my original intention to post updated graphs to Photobucket under the original file names, and suggest people wishing to see extended and corrected data just to refresh their browser, I learned I had not turned off a Photobucket feature which twists the files names.
The main extension since my previous post was to take data with 5 GPU tasks running. I also reran measurement for the points which looked most likely to be wrong based on kinks in curves and the like. Most of re-measurements gave imperceptibly different results from the first trial. My impression is that remaining kinks in this second set of graphs are real.
Here is the graph of perhaps greatest interest for pure performance reasons:
Gary Robert's observation on the GTX650 and generalization to the Kepler class that one does not improve total output by "reserving a core" as people are wont to term running fewer CPU tasks than the number of cores is borne out here: the highest RAC point is with three GPU tasks and with all four cores running CPU jobs. Focusing on the left axis, one gets the answer that each addition of a GPU parallel task gets more output, even going from 4 to 5, in the case that no CPU jobs are running. However, for number of CPU jobs greater than zero, three GPU jobs is usually best for total output at a given number of CPU jobs.
This leads in to the observation that from a pure power efficiency point of view, adding CPU jobs to a given number of GPU jobs in nearly every case lowers power efficiency.
I'll post some more graphs regarding power tradeoffs in an additional post, and add comments, and some corrections to errors I made in earlier comments.