Hyperthreading Efficiency Observations, Einstein S5 and SETI Enhanced KWSN 5.15 R-1.2

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7296538358
RAC: 2167971
Topic 191577

On the often interesting topic of hyperthreading effectiveness, I have some new data to offer on (almost) current versions.

The Applications:
SETI Enhanced, Simon's Windows P4 SSE2 32-bit V5.15 R-1.2
Einstein S5, the stock version einstein_S5R1_4.02_windows_intelx86.exe

The client:
5.3.12.tx36

The host:
Intel Gallatin 3.2 GHz, with 133 MHz dual-channel FSB RAM, Windows XP Service Pack 2

Summary results:
HT for pure SETI improved throughput by a ratio of 1.59 over nHT.
HT for pure Einstein improved throughput by a ratio of 1.20 over nHT.
HT for exact 50/50 SETI/Einstein improved throughput by a ratio of 1.22 over 50/50 nHT.

Comments:

The results are based on about one day each in running in the five implied configurations. I believe my mix of SETI AR's is at least somewhat representative, and decently matched. But your result almost certainly will vary with the specific AR and noise level of actual Work Units.

These results should not be construed as likely to predict the behavior of all hyperthreaded hosts. Gallatin is the large-cache Northwood-descended P4 which was sold as the first P4 Extreme Edition. It has 2Mbytes of L3 cache. For a Gallatin system, my RAM is probably unusually slow, as my ASUS P4B533-E motherboard only supports 133 MHz FSB, vs. the 200 probably more common on such systems. Also, by now, Prescott-derived CPUs probably are much higher fraction of the HT population than Northwood-derived CPUs, and may behave somewhat differently.

I've used time-proportioning, not credit-proportioning, to compute the 50/50 SETI/Einstein improvement ratio.

For anyone who backed off using HT somewhere around S41.07 in Akos's Enstein S4 aps, I suggest you consider giving HT a try again. The S5 WU's are fairly predictable,so you should be able to tell whether you are better or worse off pretty quickly.

If one wishes to run both SETI and Einstein, mismatched (which is what I mean by 50/50) HT is appreciably the most efficient way.

I use Trux tx36 solely on this one HT machine, and solely to facilitate 50/50 Einstein/SETI HT using the priority_projects feature. I'd like to learn another method to achieve this end before rev 5.3.12 become definitively obsolete.

While Simon brought out a new SETI enhanced release just after I started this test, his release notes make me think it likely that my results apply to his new release.

Honza
Honza
Joined: 10 Nov 04
Posts: 136
Credit: 3332354
RAC: 0

Hyperthreading Efficiency Observations, Einstein S5 and SETI Enh

archae86, it is a good topic.

Where I think EAH/EAH or EAH/SETi results are ok, I quite don't believe SETI/SETI ratio over nHT.
Intel HT is far from being dual-core but your results shows close to dual-core (ratio 1.85 let's say) than usual HT (ratio of 1.2 or 20%).

For project with different WUs lenght, it is correct to run off-line and re-run with different setp (i.e. HT and then nHT). It is good when you 'pair' similar WUs.

If you are running trux's boinc core, you might be interested in playing with affinity, which server well on Xeon that are known having trouble with multi-threaded apps where no affinity is set.

I read the final message - it is worth try to go HT again on Einstein S5 for those who were on akos's S4 app...

Bengt Larsson
Bengt Larsson
Joined: 6 Jan 06
Posts: 3
Credit: 16766
RAC: 0

RE: Where I think EAH/EAH

Message 42573 in response to message 42572

Quote:

Where I think EAH/EAH or EAH/SETi results are ok, I quite don't believe SETI/SETI ratio over nHT.

I did a test with the previous SETI application (in october, when I bought this computer) and I got a factor of 1.40 improvement with HT (This was on a 3 GHz Prescott, with 2 MB 2nd-level cache). As a data point.

Honza
Honza
Joined: 10 Nov 04
Posts: 136
Credit: 3332354
RAC: 0

40% speed-up would have been

40% speed-up would have been good :-)

Bengt Larsson
Bengt Larsson
Joined: 6 Jan 06
Posts: 3
Credit: 16766
RAC: 0

RE: 40% speed-up would have

Message 42575 in response to message 42574

Quote:
40% speed-up would have been good :-)

Yes, I thought it was pretty good. In this case, I doubted it, because I didn't try with very many workunits. On the other hand, it's plausible: SETI does FFTs, that have inherently nasty memory access patterns, and bust any size cache. My 2nd level cache takes 30 cycles to access. So there is quite a lot of waiting in the core; there is where the other thread can pick up. SETI is probably pretty much ideal for Hyperthreading.

There are two ways you can gain from Hyperthreading; one is if your processes do different things, like different kinds of instructions, like one does integer and the other floating point. The other is if there are stalls in the pipeline, especially for memory access. Then another thread can fill in and make use of the processor.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7296538358
RAC: 2167971

RE: Where I think EAH/EAH

Message 42576 in response to message 42572

Quote:
Where I think EAH/EAH or EAH/SETi results are ok, I quite don't believe SETI/SETI ratio over nHT.

I wish I had seen your disbelief in time to document the performance vs. angle range. However these results have almost all expired from display on the web page, and my BoincView log lacks AR. As an inferior substitute, I've plotted the Credits/hour vs. credits for each result in the sample. The hyperthreaded ones use the y2 (right) axis, which halves the scale range.

Were HT and nHT equally productive, the two plotted point populations would be pretty well matched (given the adjusted axes). I think you can see they are separated by much more than 20% (or 40% for that matter).

I've looked at hyperthreading gain more than once in the past. Never have I seen SETI HT gain anywhere near so low as 20% on my machine.

If others have actual data for the current application available on other machines, it would be great to see it.

The crucial point, which is missed by generalizations such as "HT is good for about 20%, if you show more or less your data are wrong" is that HT benefit is quite highly dependent on the applications being paired (not to mention the cache size and other details of the specific CPU). Akos S41.07 (and some others) had a quite large HT harm when paired with itself on my machine. Einstein S4 paired with stock SETI non-enhanced had much better HT benefit than I show in my current measurement.

Lastly, I intentionally did not use percentage because it is unclear. I did not say 85% improvement (which can be interpreted several ways, most of them flat wrong in representing my data), I said "improved throughput by a ratio of 1.59". I should hope that was clear to all.

Winterknight
Winterknight
Joined: 4 Jun 05
Posts: 1486
Credit: 392105699
RAC: 529918

The last time I did any

The last time I did any measurements of credits/hr on an HT machine was some time ago, but with S4 and non-enhanced Seti. The approx gains were S4/S4 1.4, Seti/Seti 1.3 and S4/Seti 1.57. Sorry I cannot repeat tests, as son moved away because of new job, could ask him to try but unless I give step by step instructions he'll only mess it up. Why did I support him so much to get a computer science degree?

The gains are better with two separate cpu's like a dual P3.

Andy

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 583978126
RAC: 141307

Good analysis,

Good analysis, Archae!

Concerning seti-HT efficiency: back in the days of seti 3.03 I ran tests on one of the first HT-enalbed Xeons (Northwood), think it was 2 physical CPUs running at 1.8 GHz. The speed for 2 WUs was ~5:10h (HT disabled), while the speed for 4 HT-enabled WUs was 7:30h each. So that was a speedup factor around 1.4 and the 1.59 for the current seti looks quite reasonable.

>The gains are better with two separate cpu's like a dual P3.

That is to be expected: HT costs just 5% more die space, while a second CPU costs 100% more.

MrS

Scanning for our furry friends since Jan 2002

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7296538358
RAC: 2167971

RE: For project with

Message 42579 in response to message 42572

Quote:
For project with different WUs lenght, it is correct to run off-line and re-run with different setp (i.e. HT and then nHT).

Yes I've used that method repeatedly to provide akosf timings after Stonelord mentioned a practical method. I did not want to discard multiple days of host computing capability for this exercise.

However in the last day I did take the trouble to perform HT nHT calculation on my Gallatin host of one result:

It is a .4379 Angle Range unit with modest signal counts, so, I think, pretty fairly representative of a substantial part of the SETI Enhanced computing load.

non Hyperthreaded CPU time 17936 seconds
Hyperthreaded CPU time 23761 seconds

Here is the non-hyperthreaded result for this representative WU.

This single instrumented test case provides a 1.51 throughput improvement ratio for SETI enhanced same-application hyperthreading on my host.

It is tempting to use the high AR units which execute more quickly for such comparisons, as one "wastes" less host time. I did the same test on one such unit with Angle Range of 2.158

non Hyperthreaded CPU time 3035 seconds
Hyperthreaded CPU time 5303 seconds

This single instrumented test case provides a 1.14 throughput improvement ratio for SETI enhanced same-application hyperthreading on my host.

Here is the non-hyperthreaded result for this shorter than typical work unit.

I've not studied AR distributions recently, but for more than a year of SETI classic run time I accumuldated data, and in that period the high AR units represented a rather small portion of the overall distribution--so taking these convenient to measure units as representative may mislead some people.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.