Without going into all the details, I have found that VirtualBox sometimes interferes with the running of non-VBox projects. They were GPU projects in my case, but it might apply to the Einstein AVX tasks also. I would suggest the use of separate machines for VBox and non-VBox work, if that is possible.
That is my newest PC. It is now running 3 Einstein@home CPU tasks, one VBox task, CMS-dev, one SETI@home GPU task.One Einstein@home CPU task and one SETI@home CPU task are waiting to run (the CPU has 4 cores).CPU usage is 100%, memory usage 57%. Other PCs I have are older Linux boxes, one 32-bit running vLHC@home (VBox), SETI@home CPU and Einstein@home CPU. The oldest Linux machine, a 2008 vintage SUN WS is running 2 Einstein@home gravitational searches on its Opteron 1210 at 1.8 GHz.
Tullio
The first time I noticed a problem (Win7 64-bit, VBox 5 in all cases) was running Cosmology on my main PC where I do video recording using NPVR. The recordings were interrupted with various glitches, so I banned VBox on that machine. The next time on a different Haswell machine I was running either Cosmology, or maybe CERN (ATLAS/vLHC) and noticed my two GTX 750 Tis running very slowing on Einstein BRP4G/BRP6. A reboot fixed it for a while. Finally, I found that VBox took out Folding on still another PC, as I posted on their forum https://foldingforum.org/viewtopic.php?f=81&t=28593#p283705.
The problem may not occur with all combinations of projects, but it is too haphazard for me, and I am just not doing VBox. I need to run at least the GPU projects, none of which are VBox and hence apparently susceptible to problems. Maybe you can find a combination that works.
Now no VBox task is running on my Windows PC. Yet the Einstein@home CPU tasks seem to run slower on the Windows PC, with its superior features, than on my old Linux box. A SETI@home GPU task ended in due time.On the 32-bit laptop vLHC@home, SETI@home and Einstein@home coexist peacefully.
Tullio
Did these transition over from being beta to production? I actually just turned on beta apps for a bit to test an Intel GPU (that was a waste as all tasks failed to verify) and found myself getting these as well.
Then, today I turned beta apps off again (because I'm trying to figure out if I've done something horrible with my linux box re: GPU stuff) and when I let it get more new tasks, it got regular GPU tasks and a couple more of these.
Pardon my ignorance, but I'm "catching up" after being out of the game for a few years. (I know E@H had nothing to do with the big gravity wave discovery, but it got me excited enough to install BOINC again)
Did these transition over from being beta to production?
Yes. It's a short 'tuning' run designed to be done quickly. I'm sure the intention was to open it to as wide an 'audience' as possible, as quickly as possible.
Quote:
Pardon my ignorance, but I'm "catching up" after being out of the game for a few years. (I know E@H had nothing to do with the big gravity wave discovery, but it got me excited enough to install BOINC again)
No problem at all! It's really good to see you back again :-).
Hello,
are you planning to insert information about the search progress in the server status page like other apps? I think it's very nice to know how the search is going day by day! (and the majority of boinc projects doesn't do it)
Thank you.
Hello,
are you planning to insert information about the search progress in the server status page like other apps? I think it's very nice to know how the search is going day by day! (and the majority of boinc projects doesn't do it)
Thank you.
I would guess not for this tuning run as all the work has already been generated and if they would implement the same type of search progress data as for the other runs it would show as completed.
When the first "real" run starts I'm confident that a search progress indication will be added to the server status page.
If you like to follow the progress then look higher up on the page in the table for "Workunits and tasks", the interesting numbers would be "Tasks to send" & "Workunits without canonical result" for "O1AS20-100T". When "Workunits without canonical result" drops to 0 the tuning run will be finished.
Average runtime is excellent across the board on Intel i7 CPUs from 1st to 4th generation. My two X5690s are in an older 2U chassis and the fans are very loud when the core temps are high. Since the ambient temp has been on the warm side the last few days, I only ran 2 threads per CPU. This core load allows for a slightly higher frequency per core of 3.73 GHz compared to the factory 3.60 GHz all core frequency. The 3930K has two GPUs running BRP6 and is set to a maximum of 4-threads for CPU applications. The 5960x is the only Windows system of the group and I run Process Lasso to force the application to not run more than 1 thread per physical core. Even though this system has HT enabled, in the case of Einstein CPU applications, HT is not used.
Unfortunately, the application does not perform very well on the AMD Bulldozer CPU and this may simply be due to a limitation of this particular architecture. From everything I have read, AVX performance on Bulldozer CPUs is lackluster compared to SSE2 for example. I tested the base x64 application which performed better than the AVX application on this particular CPU.
I would be curious to know if the application were recoded to support FMA4 if there would be a performance improvement on the Bulldozer CPU. Although, having to manage another application build is likely to be a hassle.
With my AMD Linux system, I used a tool called TurionPowerControl and set psmax 1 to force the maximum all core turbo frequency of 2.7 GHz of my CPU model.
I'm a bit surprised, and suspect there is more to the difference between these two applications than mere X64-ness. 64-bit addressing unambiguously gives more efficient options for handling truly large working sets, and at the hardware level allows installation of larger memory, but I'd not expect this sort of considerable execution time benefit.
One thing not to forget is that code compiled to run also on 32 bit machines / OSes has access to fewer CPU registers. The previous GW searches used mostly hand coded SSE/SSE2 code in critical places that didn't make much (or any) use of the additrional registers in 64 bit mode, but the current GW app uses e.g. FFTW, and I suspect this module could get a boost from the extra registers in 64 bit mode.
I agree that cache size is a very significant factor, probably also because of the FFT part. IIRC, the FFTs done in the current GW search are of length ca 2^17, single precision, complex to complex, in-place. So order of magnitude wise, the data per Fourier transform is about 2^17 x 2 x 4 bytes, or 2^20 bytes or 1 MB. That is just for the FFT.
Other Einstein@Home searches like the Fermi searches use much longer Fourier transforms (ca a factor of 16 and more iIRC), so this might explain why this search is more sensitive to cache size than other searches here.
I have three very old but rather similar (ca 2008) Core2 era hosts that a) all lack AVX support, of course b) have not so much different BOINC benchmark ratings but c) differ in task runtime a lot:
Core 2 Quad Q8200 @ 2.33 GHz , 2 x 2MB L2 cache, so only 1 MB cache per core, running 3 CPU tasks ==> ca 107k sec per task (!!)
Xeon E5405 @ 2.00GHz (quadcore), 2x 6MB L2 cache , so 3 MB per core , dual CPU, running 7 CPU tasks. ==> ca 70k sec per task
Core2 Duo E8400 @ 3.00GHz , 6 MB L2 cache, so 3 MB per core, and running just one CPU task (other core for GPU task) ==> 49k sec (and that with 32 bit app so far). Not bad for a machine that old.
So yes, cache size (per running task) seems to matter a lot.
I can totally confirm this: cache size and and more important cache speed is number #1 performance factor for current GW app. Atleast on AMD CPUs.
First i noticed what my AMD Phenom II X4 running @ 3.6 GHz takes more time to finish 1 task compared to my Phenom II X6 running @ 3.36 GHz:
average 62k sec per task on Phenom II X4 @ 3.6 Ghz (4 tasks in parallel)
average 53k sec per task on Phenom II X6 @ 3.36 Ghz (4 tasks in parallel + 2 cores run other app including GPU support)
It was very interesting - this computers have almost exactly same CPU core architecture: AMD K10.5 cores, both support SSE2 and not support AVX, have same cache organization and sizes (64k L1 per core, 512k L2 per core, 6 Mb L3 shared total) - only major difference is number of cores (4 vs 6) and cores interconnect
also PCs share same RAM volume and speed (2x4=8 Gb dual channel DDR3-1600)
But CPU with ~8% less clock speed was faster by ~17%. Or ~25% faster per 1 GHz
First thought - WTF? Second - may be it is 32 bit vs 64 bit difference? Because X4 work on 32bit win and X6 on 64 bit. But i never saw such a big speedup from x64 - usual it not above 5%, but here is ~25% speedup.
And only later i remembers that my X6 is actually overclocked by 20% via bus clock: it is have nominal clock = 2.8 GHz(1055T model). And if you rise bus clock on Phenom II it rise not only core clock(like if just multiplier changed), but integrated north bridge and L3 cache speed too at same time.
So i run CPU-info tool and check actual L3 speed:
nominal 2 GHz L3 / 3.6 GHz core on X4
overclocked 2.2 GHz L3 / 3.36 GHz core on X6
so X6 actually have +10% L3 speed bonus compared to X4
+10% L3 speed + x64 optimizations = 17% speed up? Sounds plausible now!
At last i decide to do final check: i overclock L3 on X6 even more. I did it by changing only L3 multiplier (my MB have separate setting for almost every multipliers) so ALL other parameters remain same: same core clock (3.36 GHZ), same RAM clock (800/1600 MHz), same system bus basic clock (240 Mhz) and HT clock. Only L3/NB clock changed from 2.2 GHz to 2.4 GHz.
And now i see nice GW app speed-up: from ~53k sec per task to <49k sec per task average
It was almost linear with L3 cache speed: +9% L3 clock = +8% app speed
Thanks for going to all that trouble and for publishing your results.
I'm sure the Devs will be interested in your findings.
I have some old Phenom II X2s and have run them as X4s using the BIOS core unlocking feature built into certain motherboards. The boards are quite low end and don't have (that I've noticed) the ability to tweak the L3 multiplier. I'll probably take a closer look now :-). They tend to run hot anyway and the environment is rather warm so overclocking has been modest if at all. They're more productive when completely stable even if a little slower :-).
My hosts are hopefully near the end of a long hot summer and in the much cooler months ahead there is more scope for these sorts of experiments :-).
RE: Without going into all
)
That is my newest PC. It is now running 3 Einstein@home CPU tasks, one VBox task, CMS-dev, one SETI@home GPU task.One Einstein@home CPU task and one SETI@home CPU task are waiting to run (the CPU has 4 cores).CPU usage is 100%, memory usage 57%. Other PCs I have are older Linux boxes, one 32-bit running vLHC@home (VBox), SETI@home CPU and Einstein@home CPU. The oldest Linux machine, a 2008 vintage SUN WS is running 2 Einstein@home gravitational searches on its Opteron 1210 at 1.8 GHz.
Tullio
The first time I noticed a
)
The first time I noticed a problem (Win7 64-bit, VBox 5 in all cases) was running Cosmology on my main PC where I do video recording using NPVR. The recordings were interrupted with various glitches, so I banned VBox on that machine. The next time on a different Haswell machine I was running either Cosmology, or maybe CERN (ATLAS/vLHC) and noticed my two GTX 750 Tis running very slowing on Einstein BRP4G/BRP6. A reboot fixed it for a while. Finally, I found that VBox took out Folding on still another PC, as I posted on their forum https://foldingforum.org/viewtopic.php?f=81&t=28593#p283705.
The problem may not occur with all combinations of projects, but it is too haphazard for me, and I am just not doing VBox. I need to run at least the GPU projects, none of which are VBox and hence apparently susceptible to problems. Maybe you can find a combination that works.
Now no VBox task is running
)
Now no VBox task is running on my Windows PC. Yet the Einstein@home CPU tasks seem to run slower on the Windows PC, with its superior features, than on my old Linux box. A SETI@home GPU task ended in due time.On the 32-bit laptop vLHC@home, SETI@home and Einstein@home coexist peacefully.
Tullio
Did these transition over
)
Did these transition over from being beta to production? I actually just turned on beta apps for a bit to test an Intel GPU (that was a waste as all tasks failed to verify) and found myself getting these as well.
Then, today I turned beta apps off again (because I'm trying to figure out if I've done something horrible with my linux box re: GPU stuff) and when I let it get more new tasks, it got regular GPU tasks and a couple more of these.
Pardon my ignorance, but I'm "catching up" after being out of the game for a few years. (I know E@H had nothing to do with the big gravity wave discovery, but it got me excited enough to install BOINC again)
RE: Did these transition
)
Yes. It's a short 'tuning' run designed to be done quickly. I'm sure the intention was to open it to as wide an 'audience' as possible, as quickly as possible.
No problem at all! It's really good to see you back again :-).
Cheers,
Gary.
Hello, are you planning to
)
Hello,
are you planning to insert information about the search progress in the server status page like other apps? I think it's very nice to know how the search is going day by day! (and the majority of boinc projects doesn't do it)
Thank you.
RE: Hello, are you planning
)
I would guess not for this tuning run as all the work has already been generated and if they would implement the same type of search progress data as for the other runs it would show as completed.
When the first "real" run starts I'm confident that a search progress indication will be added to the server status page.
If you like to follow the progress then look higher up on the page in the table for "Workunits and tasks", the interesting numbers would be "Tasks to send" & "Workunits without canonical result" for "O1AS20-100T". When "Workunits without canonical result" drops to 0 the tuning run will be finished.
I would like to share my
)
I would like to share my runtime results and configurations that I ran with various Intel and AMD CPUs.
Average runtime is excellent across the board on Intel i7 CPUs from 1st to 4th generation. My two X5690s are in an older 2U chassis and the fans are very loud when the core temps are high. Since the ambient temp has been on the warm side the last few days, I only ran 2 threads per CPU. This core load allows for a slightly higher frequency per core of 3.73 GHz compared to the factory 3.60 GHz all core frequency. The 3930K has two GPUs running BRP6 and is set to a maximum of 4-threads for CPU applications. The 5960x is the only Windows system of the group and I run Process Lasso to force the application to not run more than 1 thread per physical core. Even though this system has HT enabled, in the case of Einstein CPU applications, HT is not used.
Unfortunately, the application does not perform very well on the AMD Bulldozer CPU and this may simply be due to a limitation of this particular architecture. From everything I have read, AVX performance on Bulldozer CPUs is lackluster compared to SSE2 for example. I tested the base x64 application which performed better than the AVX application on this particular CPU.
I would be curious to know if the application were recoded to support FMA4 if there would be a performance improvement on the Bulldozer CPU. Although, having to manage another application build is likely to be a hassle.
With my AMD Linux system, I used a tool called TurionPowerControl and set psmax 1 to force the maximum all core turbo frequency of 2.7 GHz of my CPU model.
RE: RE: I'm a bit
)
I can totally confirm this: cache size and and more important cache speed is number #1 performance factor for current GW app. Atleast on AMD CPUs.
First i noticed what my AMD Phenom II X4 running @ 3.6 GHz takes more time to finish 1 task compared to my Phenom II X6 running @ 3.36 GHz:
average 62k sec per task on Phenom II X4 @ 3.6 Ghz (4 tasks in parallel)
average 53k sec per task on Phenom II X6 @ 3.36 Ghz (4 tasks in parallel + 2 cores run other app including GPU support)
It was very interesting - this computers have almost exactly same CPU core architecture: AMD K10.5 cores, both support SSE2 and not support AVX, have same cache organization and sizes (64k L1 per core, 512k L2 per core, 6 Mb L3 shared total) - only major difference is number of cores (4 vs 6) and cores interconnect
also PCs share same RAM volume and speed (2x4=8 Gb dual channel DDR3-1600)
But CPU with ~8% less clock speed was faster by ~17%. Or ~25% faster per 1 GHz
First thought - WTF? Second - may be it is 32 bit vs 64 bit difference? Because X4 work on 32bit win and X6 on 64 bit. But i never saw such a big speedup from x64 - usual it not above 5%, but here is ~25% speedup.
And only later i remembers that my X6 is actually overclocked by 20% via bus clock: it is have nominal clock = 2.8 GHz(1055T model). And if you rise bus clock on Phenom II it rise not only core clock(like if just multiplier changed), but integrated north bridge and L3 cache speed too at same time.
So i run CPU-info tool and check actual L3 speed:
nominal 2 GHz L3 / 3.6 GHz core on X4
overclocked 2.2 GHz L3 / 3.36 GHz core on X6
so X6 actually have +10% L3 speed bonus compared to X4
+10% L3 speed + x64 optimizations = 17% speed up? Sounds plausible now!
At last i decide to do final check: i overclock L3 on X6 even more. I did it by changing only L3 multiplier (my MB have separate setting for almost every multipliers) so ALL other parameters remain same: same core clock (3.36 GHZ), same RAM clock (800/1600 MHz), same system bus basic clock (240 Mhz) and HT clock. Only L3/NB clock changed from 2.2 GHz to 2.4 GHz.
And now i see nice GW app speed-up: from ~53k sec per task to <49k sec per task average
It was almost linear with L3 cache speed: +9% L3 clock = +8% app speed
Thanks for going to all that
)
Thanks for going to all that trouble and for publishing your results.
I'm sure the Devs will be interested in your findings.
I have some old Phenom II X2s and have run them as X4s using the BIOS core unlocking feature built into certain motherboards. The boards are quite low end and don't have (that I've noticed) the ability to tweak the L3 multiplier. I'll probably take a closer look now :-). They tend to run hot anyway and the environment is rather warm so overclocking has been modest if at all. They're more productive when completely stable even if a little slower :-).
My hosts are hopefully near the end of a long hot summer and in the much cooler months ahead there is more scope for these sorts of experiments :-).
Once again, thanks for your contribution.
Cheers,
Gary.