Thought I'd be a good boy and post my intel GPU feedback here instead of in the tech news section :)
Setup: i7-4770K at 4MHz, intel HD4600, ati 7970 stock. Link
Running 4xFGRP2 on CPU, 3xBRP5 on the 7970 and 1xBRP4 on the HD4600
Things that changed when I started to run the intel gpu tasks:
1. Number of CPU tasks went down from 5 to 4 (with 75% of 8 cores setting). Not surprising since the intel gpu task says it will take 0.5 CPU. And the 3xBRP5 takes another 1.5 cpu.
2. Run time for the CPU FGRP2 tasks went up from around 51000 secs to around 57000 secs. Fair enough.
3. Run time for the 7970 BRP5 tasks went up from around 7800 secs to around 9200 secs. Say what ?!
Point 3 surprises me. I can't explain why the 7970 tasks suddenly takes much longer. The CPU has plenty of idle time to feed the 7970, according to win7 task manager at least, might not factor in the intel gpu load? GPU-Z shows steady 87% load on the 7970. Any ideas?
Point 3 surprises me. I can't explain why the 7970 tasks suddenly takes much longer. The CPU has plenty of idle time to feed the 7970, according to win7 task manager at least, might not factor in the intel gpu load? GPU-Z shows steady 87% load on the 7970. Any ideas?
iGPU's use shared memory, so a lot of the memory bandwidth is used by them.
iGPU's use shared memory, so a lot of the memory bandwidth is used by them.
So what you're saying is that the iGPU app will hog all memory bandwidth and impact everything else running on the machine?
Looking at the resource monitor in win7 show no memory activity from the iGPU app. But it might be the win7 resource monitor isn't the best tool to look at memory bandwidth.
I'll experiment some with turning off various tasks when I get the time for it. Thanks for the tip.
The CPU has 8MB L3 cache so that probably means lots of reading and writing to the DDR3 (which is a lot slower than DDR5 on discrete GPU's).
Noticed a 6.6% slow down in GPU performance (at GPUGrid) as a result of running on the iGPU:
83x3-NOELIA_7MG_RUN1-0-2-RND2598_0 4586331 11 Jul 2013 | 22:58:12 UTC 12 Jul 2013 | 12:05:36 UTC Completed and validated 43,012.44 20,515.30 150,000.00
18x8-NOELIA_7MG_RUN-0-2-RND6573_0 4583986 10 Jul 2013 | 20:22:10 UTC 11 Jul 2013 | 8:06:51 UTC Completed and validated 40,345.62 17,894.67 150,000.00
Quick look at the iGPU performance compared to the CPU and discrete GPU (on same system):
p2030.20130408.G36.05-01.12.C.b0s0g0.00000_1648_0 169135420 11 Jul 2013 21:50:57 UTC 12 Jul 2013 0:01:00 UTC Completed and validated 707.73 27.55 0.30 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20121022.G195.17-01.31.N.b3s0g0.00000_2626_1 168311486 29 Jun 2013 23:03:00 UTC 30 Jun 2013 2:48:02 UTC Completed and validated 3,186.85 3,164.54 33.65 62.50 Binary Radio Pulsar Search (Arecibo) v1.33 (BRP4X64)
PA0079_01421_129_1 168285850 29 Jun 2013 13:46:24 UTC 29 Jun 2013 18:59:39 UTC Completed and validated 17,434.88 1,566.39 16.66 5,000.00 Binary Radio Pulsar Search (Perseus Arm Survey) v1.36 (BRP4cuda32nv301)
The intel hd4000 is about 4.5 times faster than a single CPU thread, or 65% as fast as the CPU (all 8threads).
Going by credits/day my mid-range NVidia GTX660 gets ~3.2times as much credit as the iGPU and twice the credit of 8 CPU threads (@4.2GHz).
I have seen substantial runtime variation from the Intel hd4000, ~650 to 750sec/WU. I expect other work is the cause, just as using the iGPU impacts on both CPU and discrete GPU performance.
Choosing apps and setups to find the most optimal balance for using the CPU, iGPU and discrete GPU's might be tricky, especially given the credit discrepancy between this project and other GPU projects, if credit is important to you.
The PCIE controller is on these processors, so PCIE bandwidth might become a problem if the CPU is too busy and using the iGPU could in theory make it very busy...
When I suspended GPU computation on the NVidia's the iGPU temperature dropped to 72°C and the GPU power rose slightly from 13.5 to 14W. This suggests that when the discrete GPU's are running the iGPU temperature rises but performance (for the Einstein app drops, probably due to competition).
iGPU performance dropped slightly however (a few %). This can be explained by the fact that more CPU tasks ran (when the two discrete GPU's were running, there was less CPU work)
With no CPU tasks running and no discrete GPU work, the iGPU temperature plummeted to 46°C and power usage rose further to 14.4W. The GPU load went up from 95 to 96% (too small to conclude anything). Performance of the iGPU increased by ~20%.
As expected when the iGPU was in use along with the two discrete GPU's, but no CPU work running, the iGPU temperature wasn't that high, ~55°C. Power usage was ~14.3W. iGPU performance also remained much higher, 20% better than with 6 CPU tasks running. This shows that the discrete GPU's don't actually compete much against the iGPU for resources and that the CPU does. So it looks like the CPU cache and system memory usage slows the iGPU down when CPU tasks are running. It might be a lot less with only 4 CPU threads in use and a lot worse trying to use all 8 CPU threads for CPU work, the iGPU and two discrete GPU's. The iGPU does however impact on the discrete GPU's memory controller load. When running an Intel WU and two NVidia work units for GPUGrid the memory controller load went from 42% down to 39% for the 660Ti, compared to not running any Intel WU's. That's a 7.7% difference in memory controller load and would account for the reduction in task performance.
On last update (13/7 13:06) my machine downloaded BRP4X64 as well as opencl-intel_gpu BRP4 tasks even though I have the setting "Run CPU versions of applications for which GPU versions are available" set to "no". Not done this before since I started running the iGPU BRP4 tasks.
I am seeing a speed decrease as well when running Einstein on the iGPU. I don't have much hard data, though, as my tasks changed a lot recently. Actually I can't comment on runtimes of CPU projects at all. My nVidia is usually running GPU-Grid, but again too much has changed.
However, what I can say is that when I run POEM I see a definite performance hit of ~15% for POEM (7 tasks in parallel, no other CPU tasks) and for Einstein @ iGPU.
I suppose main memory bandwdith is the key issue here. I know for sure that POEM is pretty much starved for bandwidth: when I switched from DDR3-2000 to DDR3-2400 the POEM throughput of my GPU increased by almost 10%! And I know Einstein requires quite some bandwidth: CPU performance reacts favorably to more bandwidth, more so than most applications, and on nVidia GPUs the memory controller load is quite high (significantly higher than for GPU-Grid, which itself is already quite demanding). Running 2 Einstein WUs in parallel I recently pushed a GT640 with overclocked memory to 99% GPU load and 99% memory controller load.
I think it's pretty safe to assume that the iGPU fights all other processes for bandwidth. It will be interesting if the eDRAM of HD5200 can improve the situation noticeably (if we only could buy them). And iGPU bandwidth require ments should scale with performance. Hence it might be a good idea to:
- downclock and undervolt the iGPU to gain energy efficiency while relieving some stress on the memory subsystem (in fact that's what I'm trying right now.. but testing will take some time)
- run less bandwidth-demanding CPU apps along with Einstein@iGPU
@Bernd: knowing you guys a bit I'm sure you already optimized the apps quite a bit. But anything that reduces the amount of memory bandwidth needed (without sacrificing performance) could help, especially if we're crunching on all CPU threads and the GPU.
Running haswell 4770K @3.9GHz default overclock by mainboard. Was quite surprised that it ignored turbo speed bumbs at any load. so i had to get a decent cpu cooler ;)
back to the facts
getting igpu to work was quite frustrating at first. IGPU was detect by windows, and monitor went black no more output at all after windows loaded, while discrete one worked well.
The trick was removing any discrete grafic card and install intel drives + opencl sdk once again --> double checking tray icon.
Everything went fine after reinstalled AMD card. Boinc detected both gpus. Just needed to fix primary display glitsches and yes, dummy plug needed.
performance is quite interessting running (stock 1250MHz)
7x BR4x64 = igpu results in runtimes aroung 15 min ( 900 sec ) no change in runtime if i go down to six
7x boincsimap = igpu results in runtimes aroung 10 min ( 600 - 660 sec ) even if amd card is on duty too (distributed.net)
Maybe it has something to do with memory bandwith issues.
had to disabled cpu downloads for E@H via E@H preferences as my main goal was to spent igpu for this project while cpu chuckels other tasks.
I've just finished my first Persus BRP5 1.39 on the HD4000. Running 2 WUs in parallel at 1350 MHz with DDR3-2400 results in runtimes of 54350 s, or 10.6k RAC. And that was with POEM running full blast except for 4 of 15 h, where it was 3 POEMs and 1 GPU-Grid.
POEM does fight with Einstein for main memory bandwidth, hence for the small tasks I'm seeing:
8.8k RAC without POEM
7.9k RAC with POEM
first off, i don't really have any feedback to contribute b/c i aborted the first task that i found running on my HD 4000. i woke up this morning to find 3 more of them running in parallel on the iGPU, but i haven't cancelled them yet.
my question is this: why the heck am i getting these tasks in the first place when my project preferences for this specific host are set to accept nVidia GPU tasks only (and not CPU tasks, AMD GPU tasks, or Intel GPU tasks)? i've got 2 GTX 580s in this machine each running 4 BRP tasks at a time (8 GPU tasks running simultaneously in total). they used to complete in approx. ~6,700s, but ever since the iGPU started crunching BRP tasks, but dGPU BRP task run times have increased substantially and erratically (dGPU task run times now range from 7,000s to 10,000s). that is unacceptable in my book...besides, my intentions have always been to just use the iGPU to run the display only, while the dGPUs serve as dedicated crunchers.
if anyone can help me figure out why i'm getting intel GPU (iGPU) BRP tasks when my project preferences are specifically set to not accept them, i would appreciate it.
BRP4 Intel GPU app feedback thread
)
Thought I'd be a good boy and post my intel GPU feedback here instead of in the tech news section :)
Setup: i7-4770K at 4MHz, intel HD4600, ati 7970 stock. Link
Running 4xFGRP2 on CPU, 3xBRP5 on the 7970 and 1xBRP4 on the HD4600
Things that changed when I started to run the intel gpu tasks:
1. Number of CPU tasks went down from 5 to 4 (with 75% of 8 cores setting). Not surprising since the intel gpu task says it will take 0.5 CPU. And the 3xBRP5 takes another 1.5 cpu.
2. Run time for the CPU FGRP2 tasks went up from around 51000 secs to around 57000 secs. Fair enough.
3. Run time for the 7970 BRP5 tasks went up from around 7800 secs to around 9200 secs. Say what ?!
Point 3 surprises me. I can't explain why the 7970 tasks suddenly takes much longer. The CPU has plenty of idle time to feed the 7970, according to win7 task manager at least, might not factor in the intel gpu load? GPU-Z shows steady 87% load on the 7970. Any ideas?
RE: Point 3 surprises me.
)
iGPU's use shared memory, so a lot of the memory bandwidth is used by them.
RE: iGPU's use shared
)
So what you're saying is that the iGPU app will hog all memory bandwidth and impact everything else running on the machine?
Looking at the resource monitor in win7 show no memory activity from the iGPU app. But it might be the win7 resource monitor isn't the best tool to look at memory bandwidth.
I'll experiment some with turning off various tasks when I get the time for it. Thanks for the tip.
Running 1 BRP (Arecibo) 1.34
)
Running 1 BRP (Arecibo) 1.34 (opencl-intel_gpu) WU at a time. Last task took ~11.21min on a iHD4000.
p2030.20130408.G36.17-01.35.N.b2s0g0.00000_78_0 169226400 12 Jul 2013 9:01:20 UTC 12 Jul 2013 10:21:59 UTC Completed, waiting for validation 685.81 24.40 0.26 pending Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130408.G36.17-01.35.N.b1s0g0.00000_3764_0 169226156 12 Jul 2013 9:01:20 UTC 12 Jul 2013 11:01:05 UTC Completed, waiting for validation 681.53 25.08 0.27 pending Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130408.G36.17-01.35.C.b6s0g0.00000_3534_1 169225696 12 Jul 2013 9:01:20 UTC 12 Jul 2013 10:39:28 UTC Completed, waiting for validation 679.93 24.55 0.26 pending Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130408.G36.17-01.35.C.b6s0g0.00000_3506_1 169225640 12 Jul 2013 9:01:20 UTC 12 Jul 2013 11:01:05 UTC Completed, waiting for validation 681.62 25.72 0.28 pending Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
i7-3770K @4.2GHz (75% CPU presently in use, also running 2 GPUGrid WU's on a GTX660Ti and a GTX660 and 6 Albert Gravitational Wave searches).
GPUz:
GPU Core Clock 1350.0MHz
GPU Memory Clock 1066.7MHz
GPU Temperature 80.0°C
GPU Power 13.5W
GPU Load 95%
Memory Usage (dedicated) 9MB
Memory Usage (Dynamic) 391MB
The CPU has 8MB L3 cache so that probably means lots of reading and writing to the DDR3 (which is a lot slower than DDR5 on discrete GPU's).
Noticed a 6.6% slow down in GPU performance (at GPUGrid) as a result of running on the iGPU:
83x3-NOELIA_7MG_RUN1-0-2-RND2598_0 4586331 11 Jul 2013 | 22:58:12 UTC 12 Jul 2013 | 12:05:36 UTC Completed and validated 43,012.44 20,515.30 150,000.00
18x8-NOELIA_7MG_RUN-0-2-RND6573_0 4583986 10 Jul 2013 | 20:22:10 UTC 11 Jul 2013 | 8:06:51 UTC Completed and validated 40,345.62 17,894.67 150,000.00
Quick look at the iGPU performance compared to the CPU and discrete GPU (on same system):
p2030.20130408.G36.05-01.12.C.b0s0g0.00000_1648_0 169135420 11 Jul 2013 21:50:57 UTC 12 Jul 2013 0:01:00 UTC Completed and validated 707.73 27.55 0.30 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20121022.G195.17-01.31.N.b3s0g0.00000_2626_1 168311486 29 Jun 2013 23:03:00 UTC 30 Jun 2013 2:48:02 UTC Completed and validated 3,186.85 3,164.54 33.65 62.50 Binary Radio Pulsar Search (Arecibo) v1.33 (BRP4X64)
PA0079_01421_129_1 168285850 29 Jun 2013 13:46:24 UTC 29 Jun 2013 18:59:39 UTC Completed and validated 17,434.88 1,566.39 16.66 5,000.00 Binary Radio Pulsar Search (Perseus Arm Survey) v1.36 (BRP4cuda32nv301)
The intel hd4000 is about 4.5 times faster than a single CPU thread, or 65% as fast as the CPU (all 8threads).
Going by credits/day my mid-range NVidia GTX660 gets ~3.2times as much credit as the iGPU and twice the credit of 8 CPU threads (@4.2GHz).
I have seen substantial runtime variation from the Intel hd4000, ~650 to 750sec/WU. I expect other work is the cause, just as using the iGPU impacts on both CPU and discrete GPU performance.
Choosing apps and setups to find the most optimal balance for using the CPU, iGPU and discrete GPU's might be tricky, especially given the credit discrepancy between this project and other GPU projects, if credit is important to you.
The PCIE controller is on these processors, so PCIE bandwidth might become a problem if the CPU is too busy and using the iGPU could in theory make it very busy...
When I suspended GPU computation on the NVidia's the iGPU temperature dropped to 72°C and the GPU power rose slightly from 13.5 to 14W. This suggests that when the discrete GPU's are running the iGPU temperature rises but performance (for the Einstein app drops, probably due to competition).
iGPU performance dropped slightly however (a few %). This can be explained by the fact that more CPU tasks ran (when the two discrete GPU's were running, there was less CPU work)
With no CPU tasks running and no discrete GPU work, the iGPU temperature plummeted to 46°C and power usage rose further to 14.4W. The GPU load went up from 95 to 96% (too small to conclude anything). Performance of the iGPU increased by ~20%.
As expected when the iGPU was in use along with the two discrete GPU's, but no CPU work running, the iGPU temperature wasn't that high, ~55°C. Power usage was ~14.3W. iGPU performance also remained much higher, 20% better than with 6 CPU tasks running. This shows that the discrete GPU's don't actually compete much against the iGPU for resources and that the CPU does. So it looks like the CPU cache and system memory usage slows the iGPU down when CPU tasks are running. It might be a lot less with only 4 CPU threads in use and a lot worse trying to use all 8 CPU threads for CPU work, the iGPU and two discrete GPU's. The iGPU does however impact on the discrete GPU's memory controller load. When running an Intel WU and two NVidia work units for GPUGrid the memory controller load went from 42% down to 39% for the 660Ti, compared to not running any Intel WU's. That's a 7.7% difference in memory controller load and would account for the reduction in task performance.
On last update (13/7 13:06)
)
On last update (13/7 13:06) my machine downloaded BRP4X64 as well as opencl-intel_gpu BRP4 tasks even though I have the setting "Run CPU versions of applications for which GPU versions are available" set to "no". Not done this before since I started running the iGPU BRP4 tasks.
With CPU usage at 100% the
)
With CPU usage at 100% the iGPU's runtime increases by around 6 or 7 times:
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_1060_1 169629166 14 Jul 2013 9:45:40 UTC 16 Jul 2013 8:49:55 UTC Completed and validated 4,764.28 10.62 0.11 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_1026_0 169629098 14 Jul 2013 9:45:40 UTC 16 Jul 2013 8:49:55 UTC Completed and validated 4,475.40 11.31 0.12 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_1017_0 169629080 14 Jul 2013 9:45:40 UTC 16 Jul 2013 8:49:55 UTC Completed and validated 5,051.29 13.12 0.13 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_976_0 169628998 14 Jul 2013 9:44:32 UTC 14 Jul 2013 22:56:32 UTC Completed and validated 650.01 21.65 0.22 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_969_0 169628984 14 Jul 2013 9:44:32 UTC 14 Jul 2013 22:56:32 UTC Completed and validated 651.55 21.82 0.22 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
p2030.20130409.G36.65-02.29.S.b1s0g0.00000_956_1 169628958 14 Jul 2013 9:44:33 UTC 15 Jul 2013 0:09:53 UTC Completed and validated 667.88 22.59 0.23 62.50 Binary Radio Pulsar Search (Arecibo) v1.34 (opencl-intel_gpu)
I am seeing a speed decrease
)
I am seeing a speed decrease as well when running Einstein on the iGPU. I don't have much hard data, though, as my tasks changed a lot recently. Actually I can't comment on runtimes of CPU projects at all. My nVidia is usually running GPU-Grid, but again too much has changed.
However, what I can say is that when I run POEM I see a definite performance hit of ~15% for POEM (7 tasks in parallel, no other CPU tasks) and for Einstein @ iGPU.
I suppose main memory bandwdith is the key issue here. I know for sure that POEM is pretty much starved for bandwidth: when I switched from DDR3-2000 to DDR3-2400 the POEM throughput of my GPU increased by almost 10%! And I know Einstein requires quite some bandwidth: CPU performance reacts favorably to more bandwidth, more so than most applications, and on nVidia GPUs the memory controller load is quite high (significantly higher than for GPU-Grid, which itself is already quite demanding). Running 2 Einstein WUs in parallel I recently pushed a GT640 with overclocked memory to 99% GPU load and 99% memory controller load.
I think it's pretty safe to assume that the iGPU fights all other processes for bandwidth. It will be interesting if the eDRAM of HD5200 can improve the situation noticeably (if we only could buy them). And iGPU bandwidth require ments should scale with performance. Hence it might be a good idea to:
- downclock and undervolt the iGPU to gain energy efficiency while relieving some stress on the memory subsystem (in fact that's what I'm trying right now.. but testing will take some time)
- run less bandwidth-demanding CPU apps along with Einstein@iGPU
@Bernd: knowing you guys a bit I'm sure you already optimized the apps quite a bit. But anything that reduces the amount of memory bandwidth needed (without sacrificing performance) could help, especially if we're crunching on all CPU threads and the GPU.
MrS
Scanning for our furry friends since Jan 2002
Some observation from my
)
Some observation from my end.
Running haswell 4770K @3.9GHz default overclock by mainboard. Was quite surprised that it ignored turbo speed bumbs at any load. so i had to get a decent cpu cooler ;)
back to the facts
getting igpu to work was quite frustrating at first. IGPU was detect by windows, and monitor went black no more output at all after windows loaded, while discrete one worked well.
The trick was removing any discrete grafic card and install intel drives + opencl sdk once again --> double checking tray icon.
Everything went fine after reinstalled AMD card. Boinc detected both gpus. Just needed to fix primary display glitsches and yes, dummy plug needed.
performance is quite interessting running (stock 1250MHz)
7x BR4x64 = igpu results in runtimes aroung 15 min ( 900 sec ) no change in runtime if i go down to six
7x boincsimap = igpu results in runtimes aroung 10 min ( 600 - 660 sec ) even if amd card is on duty too (distributed.net)
Maybe it has something to do with memory bandwith issues.
had to disabled cpu downloads for E@H via E@H preferences as my main goal was to spent igpu for this project while cpu chuckels other tasks.
I've just finished my first
)
I've just finished my first Persus BRP5 1.39 on the HD4000. Running 2 WUs in parallel at 1350 MHz with DDR3-2400 results in runtimes of 54350 s, or 10.6k RAC. And that was with POEM running full blast except for 4 of 15 h, where it was 3 POEMs and 1 GPU-Grid.
POEM does fight with Einstein for main memory bandwidth, hence for the small tasks I'm seeing:
8.8k RAC without POEM
7.9k RAC with POEM
... so don't be afraid of the longer runtimes :)
MrS
Scanning for our furry friends since Jan 2002
first off, i don't really
)
first off, i don't really have any feedback to contribute b/c i aborted the first task that i found running on my HD 4000. i woke up this morning to find 3 more of them running in parallel on the iGPU, but i haven't cancelled them yet.
my question is this: why the heck am i getting these tasks in the first place when my project preferences for this specific host are set to accept nVidia GPU tasks only (and not CPU tasks, AMD GPU tasks, or Intel GPU tasks)? i've got 2 GTX 580s in this machine each running 4 BRP tasks at a time (8 GPU tasks running simultaneously in total). they used to complete in approx. ~6,700s, but ever since the iGPU started crunching BRP tasks, but dGPU BRP task run times have increased substantially and erratically (dGPU task run times now range from 7,000s to 10,000s). that is unacceptable in my book...besides, my intentions have always been to just use the iGPU to run the display only, while the dGPUs serve as dedicated crunchers.
if anyone can help me figure out why i'm getting intel GPU (iGPU) BRP tasks when my project preferences are specifically set to not accept them, i would appreciate it.
TIA,
Eric