Nvidia Pascal and AMD Polaris, starting with GTX 1080/1070, and the AMD 480

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225641597
RAC: 1050827

Here I update my initial

Here I update my initial report on 2X running at stock clocks with 3X running, reported in a second column. At this condition the calculated daily credit productivity of the GTX 1070 running Einstein BRP6/CUDA55 task is 166,615, out of a system total of 170,155.
[pre]2X 3X
2 3 Number of GRP6 GPU tasks at once
1 1 Number of 1.04 G Wave F tasks running at once
1:18:40 1:54:05 Average elapsed time for GPU tasks
6:47:47 6:46:44 Average elapsed time for CPU tasks
161,085 166615 Daily credit rate, GPU tasks
3,531 3540 Daily credit rate, CPU tasks
164,616 170155 System daily credit rate
183.1 185.6 Watts System power draw at the wall
1862 1860.5 average core clock rate
1901.2 1901.2 memory clock rate (did not vary)
64.1 64.4 degrees C average GPU temperature
68 68 average fan speed percentage
90 94 average GPU load percentage
84 87 average memory controller load percentage
70.8 75.4 card power consumption average percentage of TDP
118 121 approximate incremental watts attributable to Einstein work
1394 1410 credit/day per incremental watt
[/pre]

Unlike the 2X case, where the elapsed times quickly reached a highly repeatable narrow range of values, the 3X elapsed times included a significant number of long outliers. On too short a sample one might be tempted to discard these outliers from the average as anomalous. That would be an error. I have about 18 hours of data here, and despite the wide distribution, the average has not varied much in the last half, so I think my number is pretty accurate.

I suspect that with a bit of manipulation of priorities and affinities using Process Lasso I might be able to get much more consistent timing behavior. Sometimes when I have done that in the past I have gotten a nice little productivity gain, sometimes not. For the moment I shall not pursue it, as I think this has more to do with the Einstein CPU support application instances interacting with the Windows 10 scheduler than it has anything specific to do with the GTX 1070.

I've moved on to 4X about an hour ago, but units currently in progress have seen some 3X and some 4X time.

I still have not taken a careful idle power measurement in this configuration. When I do the last two numbers which refer to "incremental watts" will change slightly.

As some thoroughly inappropriate power comparisons are being made, let me point out that the incremental power reported in watts here is:
1) not quite right because I've not gotten around to a careful measurement of system idle power, but more importantly
2) includes all added system power consumption to support the GPU running Einstein work, including the CPU support task, system memory power, system I/O power, and the power to run one Einstein Gravitational Wave CPU task (which is not included in system idle power).

This power number is not remotely suitable for comparison with GPU power reported by monitoring applications, let alone that guessed at from published TDP numbers.

Regarding averages as shown above for parameters such as GPU loading and fan speed. I am relying on averaging as displayed by a copy of GPU-Z 0.8.9, which I restart after getting the new operating condition stably established, and stop browsing or other substantial interactive activity. Regarding system power consumption, I am using Ensupra meter at the wall socket, which reports to me n.nnn KWhr and hh:mm elapsed time, which for suitable long running times gives a high resolution average wattage by arithmetic.

Janus
Janus
Joined: 10 Nov 04
Posts: 27
Credit: 23862534
RAC: 86

Just got a GTX1080 Gaming X

Just got a GTX1080 Gaming X from MSI and like your 1070 it is surprisingly cold (around 48'C when running 4X).
Are we not pushing these cards hard enough with the CUDA55 tasks?

Anonymous

RE: Just got a GTX1080

Quote:
Just got a GTX1080 Gaming X from MSI and like your 1070 it is surprisingly cold (around 48'C when running 4X).
Are we not pushing these cards hard enough with the CUDA55 tasks?

What is your ambient room temp?

I dont see any machines in your list of computers that are crunching GPU WUs. At least not on E@H.

Manuel Palacios
Manuel Palacios
Joined: 18 Jan 05
Posts: 40
Credit: 224259334
RAC: 0

RE: Here I update my

Quote:

Here I update my initial report on 2X running at stock clocks with 3X running, reported in a second column. At this condition the calculated daily credit productivity of the GTX 1070 running Einstein BRP6/CUDA55 task is 166,615, out of a system total of 170,155.
[pre]2X 3X
2 3 Number of GRP6 GPU tasks at once
1 1 Number of 1.04 G Wave F tasks running at once
1:18:40 1:54:05 Average elapsed time for GPU tasks
6:47:47 6:46:44 Average elapsed time for CPU tasks
161,085 166615 Daily credit rate, GPU tasks
3,531 3540 Daily credit rate, CPU tasks
164,616 170155 System daily credit rate
183.1 185.6 Watts System power draw at the wall
1862 1860.5 average core clock rate
1901.2 1901.2 memory clock rate (did not vary)
64.1 64.4 degrees C average GPU temperature
68 68 average fan speed percentage
90 94 average GPU load percentage
84 87 average memory controller load percentage
70.8 75.4 card power consumption average percentage of TDP
118 121 approximate incremental watts attributable to Einstein work
1394 1410 credit/day per incremental watt
[/pre]

Unlike the 2X case, where the elapsed times quickly reached a highly repeatable narrow range of values, the 3X elapsed times included a significant number of long outliers. On too short a sample one might be tempted to discard these outliers from the average as anomalous. That would be an error. I have about 18 hours of data here, and despite the wide distribution, the average has not varied much in the last half, so I think my number is pretty accurate.

I suspect that with a bit of manipulation of priorities and affinities using Process Lasso I might be able to get much more consistent timing behavior. Sometimes when I have done that in the past I have gotten a nice little productivity gain, sometimes not. For the moment I shall not pursue it, as I think this has more to do with the Einstein CPU support application instances interacting with the Windows 10 scheduler than it has anything specific to do with the GTX 1070.

I've moved on to 4X about an hour ago, but units currently in progress have seen some 3X and some 4X time.

I still have not taken a careful idle power measurement in this configuration. When I do the last two numbers which refer to "incremental watts" will change slightly.

As some thoroughly inappropriate power comparisons are being made, let me point out that the incremental power reported in watts here is:
1) not quite right because I've not gotten around to a careful measurement of system idle power, but more importantly
2) includes all added system power consumption to support the GPU running Einstein work, including the CPU support task, system memory power, system I/O power, and the power to run one Einstein Gravitational Wave CPU task (which is not included in system idle power).

This power number is not remotely suitable for comparison with GPU power reported by monitoring applications, let alone that guessed at from published TDP numbers.

Regarding averages as shown above for parameters such as GPU loading and fan speed. I am relying on averaging as displayed by a copy of GPU-Z 0.8.9, which I restart after getting the new operating condition stably established, and stop browsing or other substantial interactive activity. Regarding system power consumption, I am using Ensupra meter at the wall socket, which reports to me n.nnn KWhr and hh:mm elapsed time, which for suitable long running times gives a high resolution average wattage by arithmetic.

So the 1070 is ~30% faster than a 970...assuming you can grab a EVGA SC model ~$420 in comparison to the price dropped ~$290 970...it's not as impressive a jump in productivity for the price you have to pay. It would be highly favorable if one could grab it at say ~$379 which is base MSRP. 30% more performance for 30% more dollars is not a bad upgrade par for par from 970 to 1070, but it's not mind blowing either. I shall await your times at 4x concurrency to figure out some quick maths and see if I pull the trigger on the 1070 or perhaps add another 970 and call it a day until later in the year.

Thanks again for your posts archae as i'm sure many people find them very helpful. I wonder if any project admins can speak to the application development side as I wonder if there is any "hidden" potential to be unlocked from the pascal architecture. the cuda55 jump seemed to be quite beneficial across the board and definitely improved my output on these Maxwells.

Janus
Janus
Joined: 10 Nov 04
Posts: 27
Credit: 23862534
RAC: 86

@robl: Ambient is around

@robl: Ambient is around 21-28C, fluctuates a bit. I'll add the machine to the list - it is this one.
Careful about the WU timings, though - it also has an old 560ti in it

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225641597
RAC: 1050827

Here I update my previous

Here I update my previous report on 2X and 3X running at stock clocks with 4X running, reported in a third column. At this condition the calculated daily credit productivity of the GTX 1070 running Einstein BRP6/CUDA55 task is 169,790, out of a system total of 173,310.
[pre]2X 3X 4X
2 3 4 Number of GRP6 GPU tasks at once
1 1 1 Number of 1.04 G Wave F tasks running at once
1:18:40 1:54:05 2:29:16 Average elapsed time for GPU tasks
6:47:47 6:46:44 6:49:05 Average elapsed time for CPU tasks
161,085 166,615 169790 Daily credit rate, GPU tasks
3,531 3540 3520 Daily credit rate, CPU tasks
164,616 170,155 173,310 System daily credit rate
183.1 185.6 187.0 Watts System power draw at the wall
1862 1860.5 1860.0 average core clock rate
1901.2 1901.2 1901.2 memory clock rate (did not vary)
64.1 64.4 65.2 degrees C average GPU temperature
68 68 69 average fan speed percentage
90 94 96 average GPU load percentage
84 87 89 average memory controller load percentage
70.8 75.4 73.4 card power consumption average percentage of TDP
120.26 122.3 124.1 incremental watts attributable to Einstein work
1369 1386 1396 credit/day per incremental watt
[/pre]
Again at 4X, as with 3X and unlike 2X, the distribution of elapsed times for GPU times was somewhat broadened. Here I report a 22 hour sample, which I think gets the average about right.

I finally did take a careful idle power measurement, which came it at 62.9 watts, so the last two lines in the table have been adjusted in all three columns. As a side note, my first "serious" Einstein crunching GPU was a GTX 460 which added about 50 watts to system idle power. Nvidia has made great strides (doubtless with a big help from TSMC processes) in reducing the system idle power contribution of their cards.

I've moved on to the overclocking matter. I stumbled about a bit in Nvidia Inspector this morning, but found my way to a condition at which the system is currently running:
1). 2X multiplicity
2). core clock not altered from stock (currently running 1860)
3). memory clock specified as +500 on the P0 entry tab, and 4000 on the P2 entry tab, which to my surprise gave an actual running Einstein in P2 value of 2151.6 as reported by GPU-Z, and 4303 as reported by NVI.

The first four WUs running at this condition are completions of ones started at 4X on stock clock. So far I've not seen downclock or other form of error. If nothing obvious bad happens I intend to leave it at this condition for about a day to form an overclocking baseline. I suspect this operating point will actually work, and will beat my 4X stock clock condition regarding productivity, and intend to press upward in both memory and stock clock looking for the ceiling.

I recognize that the progression of productivity suggests there may be yet more gain above 4X, but currently consider the memory overclock matter more interesting, and for testing that I prefer a smaller multiplicity as giving answers more quickly.

Anonymous

RE: @robl: Ambient is

Quote:
@robl: Ambient is around 21-28C, fluctuates a bit. I'll add the machine to the list - it is this one.
Careful about the WU timings, though - it also has an old 560ti in it

I am thinking a temp of 48 while crunching 4 WUs is excellent. Your ambient temps approximate mine (26C) so it would be a good fit in that regard. I am holding on the 480 results to see how it performs.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225641597
RAC: 1050827

The result of 22 hours of

The result of 22 hours of running with my first try at a memory overclock on my MSI GTX 1070 FE was entirely successful. This was a run at 2X multiplicity, and compared to the stock clock 2X run I saw a reduction in average GPU task elapsed time from 1:18:40 to 1:10:57, with a corresponding (computed) daily GPU productivity improving from 161,085 to 178,605.

As the memory clock rate change from stock (1901.2 went to 2151.6) was a little under +13.2%, I'm rather surprised that the productivity improvement was just under 10.9%. We all think Einstein GRP6 is strongly memory clock rate sensitive, but this is stronger than I imagined.

For the near term I intend to inch upward in memory clock rate (assuming I can get NVI to do what I want) looking for the ceiling. Shortly I intend to start a thread on Pascal overclocking, but want to resolve some of my NVI/Pascal confusions a bit first.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225641597
RAC: 1050827

Some Nvidia Inspector

Some Nvidia Inspector experimentation with my GTX 1070 got me some surprises, and I'm on a new short-term trial.

It appears that for the GTX 1070 with current driver, P0 state clock offsets, whether set by the sliders in the NVI GUI, or set using switches to the command line both propagate as offsets to the P2 state.

A slight oddity is that a commanded zero core clock offset actually gives a somewhat higher core clock than default settings--as a speculation this might have to do with bypassing temperature adjustment.

Whatever the truth of all that may be, my short-term trial is of a commanded GPU clock offset of +100 and memory clock offset of +550, giving an initial observed value as reported by GPU-Z of 1961.5/2176.9, running at 2X.

If this appears to run OK, I'll leave it for a day, then try inching up looking for the ceiling later.

Manuel Palacios
Manuel Palacios
Joined: 18 Jan 05
Posts: 40
Credit: 224259334
RAC: 0

RE: Some Nvidia Inspector

Quote:

Some Nvidia Inspector experimentation with my GTX 1070 got me some surprises, and I'm on a new short-term trial.

It appears that for the GTX 1070 with current driver, P0 state clock offsets, whether set by the sliders in the NVI GUI, or set using switches to the command line both propagate as offsets to the P2 state.

A slight oddity is that a commanded zero core clock offset actually gives a somewhat higher core clock than default settings--as a speculation this might have to do with bypassing temperature adjustment.

Whatever the truth of all that may be, my short-term trial is of a commanded GPU clock offset of +100 and memory clock offset of +550, giving an initial observed value as reported by GPU-Z of 1961.5/2176.9, running at 2X.

If this appears to run OK, I'll leave it for a day, then try inching up looking for the ceiling later.

Archae, if i'm understanding correctly, the Pascal Architecture cards are underclocking the memory by 500mhz in P2 State. For instance, on 970's the P0 memory clock rate is the rated 7000mhz which NVI reports as 3505mhz. In P2 state, the 970's run the memory at 3005mhz by default.

In essence, Pascal architecture cards are behaving similarly in their memory clock structures as Maxwell series cards. Also of note, the difference in processing time between 3005mhz and 3505mhz on my 970's is on the order of 10% speed improvement on average. I was able to run the memory at 3805mhz which is a +800mhz overclock of the P2 state and it ran quite stably, though admittedly would produce some errors in computation. At 3705mhz the card produced no errors, but I reduced my memory overclock to the rated 3505mhz for longevity and heat reasons after the change to cuda55 which by itself produced a significant speedup in computation times for the BRP6 app.

Also, can you confirm that the 1070 has 8gbps memory? Thus NVI reporting 4000mhz in P0 state? I would be interested in knowing how far the memory can be overclocked on the 1070, though don't ruin the memory chips! I figure anything getting close to a +300mhz from the rated clocks is pushing your luck.

Thanks again!

EDIT: P.S: It would seem from your findings that 4x concurrency on the 1070 produces optimal credit/day configuration, correct?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.