I suspect there is some other reason lurking deep in the system.
Perhaps, and perhaps not very deep either :-).
The Haswell is an i3-4130 (2 cores / 4 threads) @ 3.4 GHz with 2 free threads.
The Haswell Refresh is a G3258 (2 cores / 2 threads) @ 3.9 GHz with 1 free thread.
I had expected that supporting 4 GPU tasks with 1 free core would be a penalty (even though the 3.9 GHz is an obvious bonus) such that there might be a detrimental effect on the CPU component of the crunch time.
When I saw the G3258 giving 618 secs average CPU time and the i3-4130 giving more than 50% higher at 960 secs, I wondered if there was something beneficial with Haswell Refresh. Perhaps the benefit is coming from 3.9 GHz compared to 3.4 GHz, although that seems like too big a difference for not that big a frequency increase. Perhaps part of the difference is not having the use of HT.
Perhaps part of the difference is not having the use of HT.
It depends on the project of course, but I run the WCG/CEP2 work units on both an Ivy Bridge i5-3550 (4 cores, non-hyperthreaded) and on an i7-3770 (8 cores, hyperthreaded). I see only about a 12% to 15% improvement in overall throughput using hyperthreading, accounting for the difference in clock rates. With other projects, it is a bit more, but probably not over 25% in most cases.
We now have gathered enough confidence in the Beta app to release it as the official one. All your work to summarize the performance characteristics of the new app helped a great lot, so thank you very much indeed. Special Thanks to Gary who suggested and initiated this very structured and focused discussion in the current form.
Next steps:
As promised earlier, we will try to put an additional CUDA app version up that wil use a newer CUDA version, at least 5.5.
Great news. I hope you find your effort to try this rewarded by higher throughput with not too painful a set of difficulties. In my dreams I hope you will try Cuda7, which seems more likely to find better ways to use Maxwell GPUs than earlier ones, but I'll loyally test whatever you find good enough to have us try it. I've started shortening my queues in anticipation.
In my dreams I hope you will try Cuda7, which seems more likely to find better ways to use Maxwell GPUs than earlier ones, but I'll loyally test whatever you find good enough to have us try it. I've started shortening my queues in anticipation.
Me too. It should be pointed out, though it is probably obvious, that Crunchers tend to go where their hardware can best be used, other things being equal. Therefore, in deciding on versions, it is not just the present user population that should be considered, but those that will be attracted by new applications, and alternatively might leave if the grass is greener elsewhere. Therefore, you need to lead your target a little bit.
I generally don´t pay much attention to CPU times, as it it fairly meaningless outside of the context of the system in question.
Elapsed times are more comparable between systems, and in some sense more easily verified (stopwatch for example!).
However this thread, has piqued my interest, and i noticed the following.
If i run only BRP6 1.52 (x2) for both GPUs the CPU times average around 980s.
If i then run additional CPU tasks (on the i3 CPU 530) namely, GWS S6 Bucket 1.06 (X64) followups, the CPU times for the BRP6 tasks DROP to around 720s! Over 20%.
Elapsed times do not appear to change either way.
The system feels better to use browsing when these extra tasks are running.
Not what i expected. Has anyone else noticed or can explain this?
The Haswell is an i3-4130 (2 cores / 4 threads) @ 3.4 GHz with 2 free threads.
The Haswell Refresh is a G3258 (2 cores / 2 threads) @ 3.9 GHz with 1 free thread.
Perhaps part of the difference is not having the use of HT.
Ahh, that's quite a difference! The 2 CPU Threads on the i3 each run on a separate physical core (that's how the OS schedulers handle HT CPUs). If the Einstein app joins those 2 threads every now and then it's guaranteed to share a core with either of them. That's OK, but makes it take longer. On the other hand on the Pentium there's always a physical core free, so the CPU portion of the Einstein tasks completes quicker.
@AgentB: I've also got an explanation for you. The Einstein tasks of the optimized 1.52 app use little CPU time. If you're not running any CPu tasks along with them, the CPU will be at idle / base frequency (1600 MHz I think) when the Einstein tasks start. Ramping it up to full speed takes some time. If Einstein is already finished, or at least most of it, the average CPU clock speed will be well below the maximum clock speed.
If you run CPU tasks along with the GPU tasks, the continous load will keep the CPU clock up and reduce execution times. This effect is probably amplified by your CPU being a bit older, so it doesn't switch power states as quickly as newer hardware. This doesn't matter, though, as either way is fast enough to support your GPU (same elapsed times) :)
... If you're not running any CPu tasks along with them, the CPU will be at idle / base frequency (1600 MHz I think) when the Einstein tasks start. Ramping it up to full speed takes some time. If Einstein is already finished, or at least most of it, the average CPU clock speed will be well below the maximum clock speed.
Thanks very much for pointing this out! I've sometimes seen people say that they run multiple GPU tasks and leave ALL the cores free. I'm sure that helps with both power consumption and temperature but may hinder GPU performance if all the CPU cores are likely to be running at idle frequency most of the time. I remember looking at such a host some time ago and expecting to find the fastest crunch times but actually seeing what seemed to be slightly worse performance. That all makes sense now. Thanks for the explanation.
I've sometimes seen people say that they run multiple GPU tasks and leave ALL the cores free. I'm sure that helps with both power consumption and temperature but may hinder GPU performance if all the CPU cores are likely to be running at idle frequency most of the time.
Thanks also MrS every day at E@H a school day. So the CPU is not doing more it´s just doing it slower.
With the old BRP4 tasks, the CPU load on my system was much higher, and running two GPUs would keep all four CPU threads busy running 6 tasks, around the 25% mark, so that explains why i did not see this before. With BRP4 i would notice any CPU load would have a negative effect on GPU elapsed time, but thinking about it now, a single card with a better processor probably benefit from some CPU load to keep it lit up and feeding the GPU.
BRP6 is a totally different GPU app, so much so, i´m toying with the idea of running a third GPU in a PCIEx1 slot. Second hand GTX-460 are getting cheap on ebay, and i have power and cooling capacity.
Yep, same stepping says it
)
Yep, same stepping says it all. The Refresh K models are the only ones with known differences (the soldered heat spreader).
MrS
Scanning for our furry friends since Jan 2002
RE: I suspect there is some
)
Perhaps, and perhaps not very deep either :-).
The Haswell is an i3-4130 (2 cores / 4 threads) @ 3.4 GHz with 2 free threads.
The Haswell Refresh is a G3258 (2 cores / 2 threads) @ 3.9 GHz with 1 free thread.
I had expected that supporting 4 GPU tasks with 1 free core would be a penalty (even though the 3.9 GHz is an obvious bonus) such that there might be a detrimental effect on the CPU component of the crunch time.
When I saw the G3258 giving 618 secs average CPU time and the i3-4130 giving more than 50% higher at 960 secs, I wondered if there was something beneficial with Haswell Refresh. Perhaps the benefit is coming from 3.9 GHz compared to 3.4 GHz, although that seems like too big a difference for not that big a frequency increase. Perhaps part of the difference is not having the use of HT.
Cheers,
Gary.
RE: Perhaps part of the
)
It depends on the project of course, but I run the WCG/CEP2 work units on both an Ivy Bridge i5-3550 (4 cores, non-hyperthreaded) and on an i7-3770 (8 cores, hyperthreaded). I see only about a 12% to 15% improvement in overall throughput using hyperthreading, accounting for the difference in clock rates. With other projects, it is a bit more, but probably not over 25% in most cases.
We now have gathered enough
)
We now have gathered enough confidence in the Beta app to release it as the official one. All your work to summarize the performance characteristics of the new app helped a great lot, so thank you very much indeed. Special Thanks to Gary who suggested and initiated this very structured and focused discussion in the current form.
Next steps:
As promised earlier, we will try to put an additional CUDA app version up that wil use a newer CUDA version, at least 5.5.
HB
RE: a newer CUDA version,
)
Great news. I hope you find your effort to try this rewarded by higher throughput with not too painful a set of difficulties. In my dreams I hope you will try Cuda7, which seems more likely to find better ways to use Maxwell GPUs than earlier ones, but I'll loyally test whatever you find good enough to have us try it. I've started shortening my queues in anticipation.
RE: In my dreams I hope you
)
Me too. It should be pointed out, though it is probably obvious, that Crunchers tend to go where their hardware can best be used, other things being equal. Therefore, in deciding on versions, it is not just the present user population that should be considered, but those that will be attracted by new applications, and alternatively might leave if the grass is greener elsewhere. Therefore, you need to lead your target a little bit.
I generally don´t pay much
)
I generally don´t pay much attention to CPU times, as it it fairly meaningless outside of the context of the system in question.
Elapsed times are more comparable between systems, and in some sense more easily verified (stopwatch for example!).
However this thread, has piqued my interest, and i noticed the following.
If i run only BRP6 1.52 (x2) for both GPUs the CPU times average around 980s.
If i then run additional CPU tasks (on the i3 CPU 530) namely, GWS S6 Bucket 1.06 (X64) followups, the CPU times for the BRP6 tasks DROP to around 720s! Over 20%.
Elapsed times do not appear to change either way.
The system feels better to use browsing when these extra tasks are running.
Not what i expected. Has anyone else noticed or can explain this?
Gary wrote:The Haswell is an
)
Ahh, that's quite a difference! The 2 CPU Threads on the i3 each run on a separate physical core (that's how the OS schedulers handle HT CPUs). If the Einstein app joins those 2 threads every now and then it's guaranteed to share a core with either of them. That's OK, but makes it take longer. On the other hand on the Pentium there's always a physical core free, so the CPU portion of the Einstein tasks completes quicker.
@AgentB: I've also got an explanation for you. The Einstein tasks of the optimized 1.52 app use little CPU time. If you're not running any CPu tasks along with them, the CPU will be at idle / base frequency (1600 MHz I think) when the Einstein tasks start. Ramping it up to full speed takes some time. If Einstein is already finished, or at least most of it, the average CPU clock speed will be well below the maximum clock speed.
If you run CPU tasks along with the GPU tasks, the continous load will keep the CPU clock up and reduce execution times. This effect is probably amplified by your CPU being a bit older, so it doesn't switch power states as quickly as newer hardware. This doesn't matter, though, as either way is fast enough to support your GPU (same elapsed times) :)
MrS
Scanning for our furry friends since Jan 2002
RE: ... If you're not
)
Thanks very much for pointing this out! I've sometimes seen people say that they run multiple GPU tasks and leave ALL the cores free. I'm sure that helps with both power consumption and temperature but may hinder GPU performance if all the CPU cores are likely to be running at idle frequency most of the time. I remember looking at such a host some time ago and expecting to find the fastest crunch times but actually seeing what seemed to be slightly worse performance. That all makes sense now. Thanks for the explanation.
Cheers,
Gary.
RE: I've sometimes seen
)
Thanks also MrS every day at E@H a school day. So the CPU is not doing more it´s just doing it slower.
With the old BRP4 tasks, the CPU load on my system was much higher, and running two GPUs would keep all four CPU threads busy running 6 tasks, around the 25% mark, so that explains why i did not see this before. With BRP4 i would notice any CPU load would have a negative effect on GPU elapsed time, but thinking about it now, a single card with a better processor probably benefit from some CPU load to keep it lit up and feeding the GPU.
BRP6 is a totally different GPU app, so much so, i´m toying with the idea of running a third GPU in a PCIEx1 slot. Second hand GTX-460 are getting cheap on ebay, and i have power and cooling capacity.