I've built recently two identical HEDT PCs for a friend, and I've tested it with Einstein@home CPU apps (among other tests).
I thought that if it would run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.
Copyright © 2024 Einstein@Home. All rights reserved.
Quote:My conclusion: perhaps
)
There have been quite a few articles and reviews stating that Windows 10 is not particularly good at thread scheduling for this high core count part. Better was Enterprise version.
Disabling SMT would have halved the L3 memory pool available.
Don't forget that the BOINC server side configuration for ncpus is limited to 64 cores by default. I see in the code that they are thinking about the proliferation of multi-core cpus.
const int MAX_NCPUS = 64;
// max multiplier for daily_result_quota.
// need to change as multicore processors expand
Keith Myers wrote:Quote:My
)
That should be the "workstation" version, which supports 100+ threads, but this CPU has only 64. "Not particularly good" is a very polite way of saying the performance is less than halved, but in this case not the thread scheduling of Windows is to blame, as the tasks were assigned to a single thread with setting a different CPU affinity to each task.
Can you give me a link which proves it? There's no point in halving the L3 memory pool when turning off SMT. I've tried 32 tasks with SMT on, each task assigned to a different core, the performance was as low as with 32 tasks with SMT off. So my measurements don't support this idea.
It has nothing to do with my issue.
I'm running a 2950 (16 core
)
I'm running a 2950 (16 core ZEN+) and seeing similar results as well. Right now I'm only running 5 CPU threads on GW work units and four dedicated to GPU (GRB). As I increase the number cores dedicated to these CPU WUs, I find the Total Socket Power (PPT) jumps sharply and goes to 97% and higher. The time for WU completion goes up nearly exponentially. It's almost like running multiple work units on a graphics card. Note: I don't use "Precision Boost" since it's a fairly new processor and I don't want to void the warranty.
So, I've left Simultaneous Multi-Threading (SMT) enabled and just limit the number of cores via Boinc to avoid "over-threading" (I made that term up), but allowing it to happen if Windows thinks it necessary to do so - this is my main workstation. But what I've done lately is turned on Local memory mode (via Ryzan Master) to keep WUs in "near" memory.
Since I ended up with 64GB of memory (4-channel, DDR2666), there's plenty of RAM available to each Core Complex (CCX). This appears to have increased performance on the WUs slightly, however, I haven't gotten around to increasing the number of cores dedicate to CPU WUs as yet to see if it scales any better with more WUs.
I don't know how much this relates to the ZEN2 architecture, just what I'm observing with mine.
I was wrong. Thinking about
)
I was wrong. Thinking about the upcoming Threadripper 4000/Zen3 cpus with unified L3 cache.
Thank you!
)
Thank you!
I am seeing the same, but at
)
I am seeing the same, but at a smaller scale with a 2700X (16 threads and two memory channels). Throughput tapers off around 8-10 concurrent Einstein-tasks, no use in running more in parallel. With the four memory channels of 3970X you can run double the amount of tasks. So yes, memory access is a bottleneck for these tasks - the way they are written now and the way they are running (on CPU), as independent single-thread tasks, each with its own fairly large data set. Not using cache a lot.
Do you possibly have any
)
Do you possibly have any comparison with other projects? Would very much like to learn about Rosetta and the WCG's molecular docking (https://www.worldcommunitygrid.org/research/scc1/overview.do)? These to my undersatnding have fairly low memory requirements and hence may still shine with a higher number of crunching cores.
Many thanks!