(...) the shortfall (of HT_4 --ed.) to nHT_4 is quite large.
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities
so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.
In hyperthreading the OS is presented with a set of apparently equivalent CPUs. But in the current form pairs of them use the same physical hardware. So, at least in the Nehalem generation, there would be a great advantage to assigning next execution of a thread to a virtual CPU which was not only only idle itself, but whose "sibling" as you call it--the other virtual CPU in fact using the same core--was currently idle rather than to a virtual CPU whose sibling was already using a full core resource.
I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. That may have seemed a needless complication or a departure from the purity of apparent equivalence to those making the original design decisions.
That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.
No 'forcing CPU affinity' required. It's already included!
You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.
No 'forcing CPU affinity' required. It's already included!
I've seen plenty of bad scheduling with HT support in the scheduler
(CONFIG_SCHED_SMT). Though I admit, theory does appear nice.
Quote:
You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).
(edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)
Exactly right. The close frequencies and sequence numbers mean the skygrid right ascension and declination are real close. My guess ( admittedly based on old parameter estimates ) is that the sequence numbers have about a cycle of 400 work units before returning to similiar runtimes, for around that frequency value.
[ For those not familiar with this aspect of the discussion : at each assumed signal frequency the entire sky is examined in small areas ( one per work unit ) with more, and thus individually smaller, areas required for higher frequencies. Because the Earth is rotating around it's own axis, and it is also orbiting the Sun then a signal channel from each interferometer needs to be 'de-Dopplered' accordingly for each and every choice of distant sky grid element ( tiny area on a construct called the 'celestial sphere' ). Ultimately a signal is effectively expressed in what it would be like if it were heard at a place called the solar system 'barycenter'. There is another line of adjustment according to estimates of putative source movements too. The part of the algorithm that steps through the skygrid has to acknowledge some trigonometry to resolve a signal's components to the directions along which a particular interferometer's arms happen to lie at a given instant. In addition not all skygrid areas are equal which is a consequence of spherical geometry not being 'flat'. In any case the work unit's runtime used to be very dependent on skygrid position, with a marked sinusoidal variation above an amount that was constant regardless of sky position. The algorithm starts stepping from I think at the equator, but it could have been a pole as I can't remember which, and wraps around the sphere with a 'stagger' reminiscent of winding yarn around a ball. The number steps to return for another wrap around is this cycle length of approximately 400 that I'm referring to. At lower frequencies than we are currently doing now, around 3 such cycles were required to cover the entire sky grid. There was also another effect 'rippling' the sinusoidal runtime vs. sequence number curve, probably ( well that was my view ) due to conversion of co-ordinates from an Earth based equatorial view to the Earth's orbital plane or ecliptic. The Earth's axis is rotated with respect to the ecliptic, which is why we have seasons etc. In any case method changes have made all this rather less relevant now ..... but it used to be a huge issue in comparing runtimes and relative (in)efficiencies ]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.
(nb, yes, that's my message)
Quote:
I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways.
If I were to nitpick I'd say "Hardware enables software to retrieve physical layout"... ;) sorry.
Quote:
That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.
As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
know how to do that in Windows.
Quote:
However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.
Yes... use cases, use cases, use cases, use cases (to paraphrase Steve
Ballmer). I can't see one (use case, not Steve -- ed.) either.
... As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't know how to do that in Windows.
Is the Windows scheduler HT-aware yet?...
Aside: Also note that for some systems, the Intel CPUs can become memory bandwidth limited for some tasks. For those cases, you can get better performance by NOT using all the cores, or use a mix of boinc tasks so as to not hit the limits for CPU cache and memory accesses.
That was especially true for the later multi-cores using the old Intel FSB. Has that now been eased with the more recent CPUs that no longer use a 'northbridge' for RAM access?
I can vaguely remember that MS put quite some effort into making Server 2008R2 more power efficient (I think there was a review on Anandtech about this). They achieved quite an improvement over the previous versions. And as far as I remember the optimizations include NUMA-awareness and HT-awareness in the scheduler. It may not be perfect (which software is?), but if it wasn't there I'd expect the HT_4 result to be even worse, maybe right in the middle between nHT_4 and HT_8 (without a proper calculation of probabilities).
That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.
As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
know how to do that in Windows.
In the "set affinity" interface for Process Explorer it designates CPU 0 through CPU 7 on my E5620, and 0 through 3 on my Q6600.
From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. At least part and perhaps all of my impression made use of reported core-to-core temperature changes in response to task shifts. An additional difficulty is that at least some temperature-reporting aps don't use CPU identification compatible with that used in this affinity interface.
After I saw tear's note yesterday, I made a sloppy trial run in which I used suspensions to limit execution to four Einstein 3.06 HF tasks, and used this affinity mechanism to restrict each to a distinct one of the four presumed pairs. It was sloppy in that I failed to monitor things closely enough to avoid some minutes in which fewer than four tasks were running, but my initial impression is fairly strongly that a large improvement over the non-affinity modified case was demonstrated. Long ago I did affinity experiments for a Q6600 with a full SETI/Einstein task load demonstrating no improvement. That, of course, was quite a different issue than this. It is not the un-needed switching of tasks from CPU to CPU that is the primary harm here, but un-needed sharing of a physical core when an idle core is available.
I found the results for the nHT_4 << HT_4 to be inconsistent with experiments I've run in the past and like your initial reaction, surprising.
I just ran this same experiment on a Core i7-920 (OC to 3.7GHz), a Nehalem quad core with hyperthreading and 3 x 2GB of RAM under Windows 7. This is of course using the older 45 nm process versus the Westmeres 32 nm process, but as you point out, they are essentially the same architecture.
My results:
nHT_4 = 13,500 seconds
HT_4 = 13,560 seconds
Maybe someone else can run this experiment and provide an additional data point.
RE: (...) the shortfall (of
)
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities
so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
tear wrote:I'm blaming the OS
)
Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.
In hyperthreading the OS is presented with a set of apparently equivalent CPUs. But in the current form pairs of them use the same physical hardware. So, at least in the Nehalem generation, there would be a great advantage to assigning next execution of a thread to a virtual CPU which was not only only idle itself, but whose "sibling" as you call it--the other virtual CPU in fact using the same core--was currently idle rather than to a virtual CPU whose sibling was already using a full core resource.
I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. That may have seemed a needless complication or a departure from the purity of apparent equivalence to those making the original design decisions.
That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.
RE: I'm blaming the OS
)
The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.
No 'forcing CPU affinity' required. It's already included!
You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).
Happy fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
RE: RE: I'm blaming the
)
I've seen plenty of bad scheduling with HT support in the scheduler
(CONFIG_SCHED_SMT). Though I admit, theory does appear nice.
No disagreement here :)
RE: (edit: after I wrote
)
Exactly right. The close frequencies and sequence numbers mean the skygrid right ascension and declination are real close. My guess ( admittedly based on old parameter estimates ) is that the sequence numbers have about a cycle of 400 work units before returning to similiar runtimes, for around that frequency value.
[ For those not familiar with this aspect of the discussion : at each assumed signal frequency the entire sky is examined in small areas ( one per work unit ) with more, and thus individually smaller, areas required for higher frequencies. Because the Earth is rotating around it's own axis, and it is also orbiting the Sun then a signal channel from each interferometer needs to be 'de-Dopplered' accordingly for each and every choice of distant sky grid element ( tiny area on a construct called the 'celestial sphere' ). Ultimately a signal is effectively expressed in what it would be like if it were heard at a place called the solar system 'barycenter'. There is another line of adjustment according to estimates of putative source movements too. The part of the algorithm that steps through the skygrid has to acknowledge some trigonometry to resolve a signal's components to the directions along which a particular interferometer's arms happen to lie at a given instant. In addition not all skygrid areas are equal which is a consequence of spherical geometry not being 'flat'. In any case the work unit's runtime used to be very dependent on skygrid position, with a marked sinusoidal variation above an amount that was constant regardless of sky position. The algorithm starts stepping from I think at the equator, but it could have been a pole as I can't remember which, and wraps around the sphere with a 'stagger' reminiscent of winding yarn around a ball. The number steps to return for another wrap around is this cycle length of approximately 400 that I'm referring to. At lower frequencies than we are currently doing now, around 3 such cycles were required to cover the entire sky grid. There was also another effect 'rippling' the sinusoidal runtime vs. sequence number curve, probably ( well that was my view ) due to conversion of co-ordinates from an Earth based equatorial view to the Earth's orbital plane or ecliptic. The Earth's axis is rotated with respect to the ecliptic, which is why we have seasons etc. In any case method changes have made all this rather less relevant now ..... but it used to be a huge issue in comparing runtimes and relative (in)efficiencies ]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: tear wrote:I'm blaming
)
(nb, yes, that's my message)
If I were to nitpick I'd say "Hardware enables software to retrieve physical layout"... ;) sorry.
As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
know how to do that in Windows.
Yes... use cases, use cases, use cases, use cases (to paraphrase Steve
Ballmer). I can't see one (use case, not Steve -- ed.) either.
RE: ... As long as you're
)
Is the Windows scheduler HT-aware yet?...
Aside: Also note that for some systems, the Intel CPUs can become memory bandwidth limited for some tasks. For those cases, you can get better performance by NOT using all the cores, or use a mix of boinc tasks so as to not hit the limits for CPU cache and memory accesses.
That was especially true for the later multi-cores using the old Intel FSB. Has that now been eased with the more recent CPUs that no longer use a 'northbridge' for RAM access?
Happy fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
I can vaguely remember that
)
I can vaguely remember that MS put quite some effort into making Server 2008R2 more power efficient (I think there was a review on Anandtech about this). They achieved quite an improvement over the previous versions. And as far as I remember the optimizations include NUMA-awareness and HT-awareness in the scheduler. It may not be perfect (which software is?), but if it wasn't there I'd expect the HT_4 result to be even worse, maybe right in the middle between nHT_4 and HT_8 (without a proper calculation of probabilities).
MrS
Scanning for our furry friends since Jan 2002
RE: RE: That is an
)
In the "set affinity" interface for Process Explorer it designates CPU 0 through CPU 7 on my E5620, and 0 through 3 on my Q6600.
From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. At least part and perhaps all of my impression made use of reported core-to-core temperature changes in response to task shifts. An additional difficulty is that at least some temperature-reporting aps don't use CPU identification compatible with that used in this affinity interface.
After I saw tear's note yesterday, I made a sloppy trial run in which I used suspensions to limit execution to four Einstein 3.06 HF tasks, and used this affinity mechanism to restrict each to a distinct one of the four presumed pairs. It was sloppy in that I failed to monitor things closely enough to avoid some minutes in which fewer than four tasks were running, but my initial impression is fairly strongly that a large improvement over the non-affinity modified case was demonstrated. Long ago I did affinity experiments for a Q6600 with a full SETI/Einstein task load demonstrating no improvement. That, of course, was quite a different issue than this. It is not the un-needed switching of tasks from CPU to CPU that is the primary harm here, but un-needed sharing of a physical core when an idle core is available.
I found the results for the
)
I found the results for the nHT_4 << HT_4 to be inconsistent with experiments I've run in the past and like your initial reaction, surprising.
I just ran this same experiment on a Core i7-920 (OC to 3.7GHz), a Nehalem quad core with hyperthreading and 3 x 2GB of RAM under Windows 7. This is of course using the older 45 nm process versus the Westmeres 32 nm process, but as you point out, they are essentially the same architecture.
My results:
nHT_4 = 13,500 seconds
HT_4 = 13,560 seconds
Maybe someone else can run this experiment and provide an additional data point.