So I see S41.06 as much faster than the unfortunate S40.12 on this machine, but still not so fast as the S40.04 it currently runs. I'll try S41.06 on my other machines, and if it looks more promising on them, retry on this machine with another work unit and more carefully controlled conditions.
Same observation here. S40.04 is still the fastest cruncher on my Prescott 3.0 GHZ running HT'ed (so far ;-) )
As said earlier in this thread, wonder where the penalty on these P4's with large caches running ht'd lies ??
so far 1 case of 0 credit
Intel Xeon/S41.06 versus Apple/4.37 and AthlonXP/4.37
the others (a lot) on the Xeons and AthlonMP/XP's are either pending or validated without a hitch.
I'll post some times later, when I get home.
B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.
Posted 42 days ago by Zap
-------------------------------------------------------------------------------
AMD64 xp 3000 Newcastle core 10% overclock.
Went from 14k plus secs with the original app through an average of some less then 6000 with A36 to now my first result with S38 in 4235 secs.
Quite impressive Akosf.
Now with S41.06 about 2510 secs.( average of 3 z1 results) thats 5.6 times faster.!!!!
This so very impressive. No one here ever gonna forget Akosf I guess.
Validation wil take some time cos teamed up with 30k secs plus crunchers.
The only thing then, that will give a HT'ed P4 Prescott a boost is SSE2 or SSE3 optimized code ?? Or will even that not increase the speed ??
Cheers
only very limited in my opinion, the only thing that, to the best of my knowledge, would make a big impact would be a decrease in the "important" dataset, as has been done for S39L, but even that dataset (~11KB) doesn't fit in the 8KB L1 datacache of my Prestonia's (Northwood based Xeons), so when running with HT enabled, there are 2 threads, both "wanting" 11KB L1 at the same time, while the CPU only has 8KB to offer.
This means there are cache-misses, flushes, reloads and fetches from the L2 cache, or even from main system RAM (worst case) which all adds latency to just the memory-handling.
Another issue with HT is the fact that those two Einstein threads are basically doing the same type of work, both claiming resources of a similar nature from the CPU, which only has so many ALU- and FPU-execution units available.
Under ideal circumstances for HyperThreading, you would be running two different threads, one claiming ALU, the other claiming FPU execution units and their combined datasets fit together in L1 and/or L2 cache.
From my own "experiments" with HyperThreading I have found combinations like running SETI + SIMAP or SETI + Distributed.net RC5 at the same time to make the most optimal use of my Xeons, resulting in crunchtimes for both projects that were very, very close to the times I got with HT disabled on those systems.
Another issue with HT is the fact that those two Einstein threads are basically doing the same type of work, both claiming resources of a similar nature from the CPU, which only has so many ALU- and FPU-execution units available.
Under ideal circumstances for HyperThreading, you would be running two different threads, one claiming ALU, the other claiming FPU execution units and their combined datasets fit together in L1 and/or L2 cache.
From my own "experiments" with HyperThreading I have found combinations like running SETI + SIMAP or SETI + Distributed.net RC5 at the same time to make the most optimal use of my Xeons, resulting in crunchtimes for both projects that were very, very close to the times I got with HT disabled on those systems.
(hope that made sense)
That makes a lot of sense. Mixing the adequate choice of DC-applications on a HT-System can improve the overall performance a _lot_
example: Take my Northwood-P4 2.6 GHz
Calibrate S41.06 without HT with to "100%E@H" performance
Calibrate GIMPS (prime95) Trial-Factoring workload (up to 63 bit) without HT with "100%G" performance (enabling HT does not improve performance)
Running two HT-instances of S41.06 decreases performance to 75%E@H combined, so disabling HT for E@H is the usual choice for maximum throughput
Running one GIMPS-process and S41.06 hyperthreaded together yield 75%E@H _and_ 64%G throughput.
So if you have two identical (for the sake of this argument) machines and use both clients on both machines you get 150%E@H thoughput and 128%G thoughput. Thats a combined throughput of 278% for both projects compared to the 200%E@H or 200%G you get running only one project.
If you are only interested doing E@H you need to find another user who is only interested in GIMPS-TF work. Now you can team up (each user running both workloads) and both parties benefit ;)
RE: Continuing my practice
)
Same observation here. S40.04 is still the fastest cruncher on my Prescott 3.0 GHZ running HT'ed (so far ;-) )
As said earlier in this thread, wonder where the penalty on these P4's with large caches running ht'd lies ??
so far 1 case of 0
)
so far 1 case of 0 credit
Intel Xeon/S41.06 versus Apple/4.37 and AthlonXP/4.37
the others (a lot) on the Xeons and AthlonMP/XP's are either pending or validated without a hitch.
I'll post some times later, when I get home.
B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.
RE: B52, Hyperthreading
)
Thx m8, that post must have slipped thru my reading
RE: Doesn't setting the
)
No, this will set boinc to run only one project at a time.
RE: B52, Hyperthreading
)
And high latency of L2 cache, slow FSB that has to feed cores and read/write memory content.
RE: RE: Doesn't setting
)
My mistake, it's under general preferences so it should be obvious that it has nothing to do with how projects are being run.
Edited for typos.
RE: Posted 42 days ago by
)
Now with S41.06 about 2510 secs.( average of 3 z1 results) thats 5.6 times faster.!!!!
This so very impressive. No one here ever gonna forget Akosf I guess.
Validation wil take some time cos teamed up with 30k secs plus crunchers.
RE: RE: B52, Hyperthreadi
)
Thx for the answers guys.
Plz correct me if I'm wrong on this one.
The only thing then, that will give a HT'ed P4 Prescott a boost is SSE2 or SSE3 optimized code ?? Or will even that not increase the speed ??
Cheers
RE: Thx for the answers
)
only very limited in my opinion, the only thing that, to the best of my knowledge, would make a big impact would be a decrease in the "important" dataset, as has been done for S39L, but even that dataset (~11KB) doesn't fit in the 8KB L1 datacache of my Prestonia's (Northwood based Xeons), so when running with HT enabled, there are 2 threads, both "wanting" 11KB L1 at the same time, while the CPU only has 8KB to offer.
This means there are cache-misses, flushes, reloads and fetches from the L2 cache, or even from main system RAM (worst case) which all adds latency to just the memory-handling.
Another issue with HT is the fact that those two Einstein threads are basically doing the same type of work, both claiming resources of a similar nature from the CPU, which only has so many ALU- and FPU-execution units available.
Under ideal circumstances for HyperThreading, you would be running two different threads, one claiming ALU, the other claiming FPU execution units and their combined datasets fit together in L1 and/or L2 cache.
From my own "experiments" with HyperThreading I have found combinations like running SETI + SIMAP or SETI + Distributed.net RC5 at the same time to make the most optimal use of my Xeons, resulting in crunchtimes for both projects that were very, very close to the times I got with HT disabled on those systems.
(hope that made sense)
RE: Another issue with HT
)
That makes a lot of sense. Mixing the adequate choice of DC-applications on a HT-System can improve the overall performance a _lot_
example: Take my Northwood-P4 2.6 GHz
Calibrate S41.06 without HT with to "100%E@H" performance
Calibrate GIMPS (prime95) Trial-Factoring workload (up to 63 bit) without HT with "100%G" performance (enabling HT does not improve performance)
Running two HT-instances of S41.06 decreases performance to 75%E@H combined, so disabling HT for E@H is the usual choice for maximum throughput
Running one GIMPS-process and S41.06 hyperthreaded together yield 75%E@H _and_ 64%G throughput.
So if you have two identical (for the sake of this argument) machines and use both clients on both machines you get 150%E@H thoughput and 128%G thoughput. Thats a combined throughput of 278% for both projects compared to the 200%E@H or 200%G you get running only one project.
If you are only interested doing E@H you need to find another user who is only interested in GIMPS-TF work. Now you can team up (each user running both workloads) and both parties benefit ;)
Tau