Correction: the x86-64 Linux client, version 5.8.11, can be downloaded from boinc_5.8.11_x86_64-pc-linux-gnu.tgz (make sure to copy both files to the BOINC working directory). The new x64 Windows client, version 5.8.11, by Crunch3r, can be found at boinc_5.8.11_windows_amd64.zip.
I would really really like to see x86_64 supported by Einstein in some form. Native app would be best but short term a 32 bit app issued to 64 bit would be good still.
I tried just about everything to get the app working with an app info on my C2D with no luck and its stopping my fastest 2 cores from crunching Einstein during AA6.
I'm not sure if they made it into the official version, but Akos was experimenting with hotloops that used both SSE and 387 instructions to process data in parallel. If they were put into the deployed version, disabling the 387 would be a significant performance hit for an x86-64 native app.
The current Einstein App uses this method, i.e. doing "more contributing" parts of the calculation in high precision (80bit on FPU) while doing the rest in single precision (SSE). For the current setup doing everything in single precision isn't precise enough.
This complicated way of calculation, btw, is the reason why I couldn't simply compile a (native) 64bit App of the current code.
We are working on the code for S5R2, and it looks like it will become a lot cleaner, and probably everything in the "inner loop" can be done in single precision, so it will be a little faster and it should also be easier to build native 64bit Apps (yes, we do care).
The current Einstein App uses this method, i.e. doing "more contributing" parts of the calculation in high precision (80bit on FPU) while doing the rest in single precision (SSE). For the current setup doing everything in single precision isn't precise enough...
We are working on the code for S5R2, and it looks like it will become a lot cleaner, and probably everything in the "inner loop" can be done in single precision, so it will be a little faster and it should also be easier to build native 64bit Apps (yes, we do care).
Good to know!
But let me correct you in that although SSE supports only single-precision, SSE2 supports double-precision too. Of course, if Einstein really needs to use x87's extended-precision 80-bit, that's the only way to go.
And in case someone is wondering whether using SSE/SSE2 code side-by-side with x87 code is faster, it isn't, as both SSE/SSE2 and x87 share the same FPU, only through different interfaces.
But let me correct you in that although SSE supports only single-precision, SSE2 supports double-precision too. Of course, if Einstein really needs to use x87's extended-precision 80-bit, that's the only way to go.
I know that there is SIMD support for double precision, but 1) there are (or at least were at time of coding) much more machines that could do SSE but coudn't run SSE2 than that could run both, and 2) (re-)aligning the data for double precision SIMD calculation ate up all speed we would gain from doing the just four FPU calculations in two double precision SSE2 calculations. It simply wasn't worth the effort. [Edit] Modern CPUs with their "virtually two FPUs" (another interface to the same physical unit) will combine the FPU calculations for us anyway.
This is all a bit technical for me, but is it safe to assume that any CPU currently capable of 64 bit supports at least SSE2?
In other words, would the problem with SSE vs SSE2 support be irrelevant for a 64 bit app?
I too have a 64bit (Core 2 Duo) machine that normally runs 64 bit Ubuntu. It's temporarily running Windows so it can participate at Einstein but I would much rather run 64 bit Linux (so it can do a lot of work at projects that have a fast 64 bit app).
This is all a bit technical for me, but is it safe to assume that any CPU currently capable of 64 bit supports at least SSE2?
In other words, would the problem with SSE vs SSE2 support be irrelevant for a 64 bit app?
Hardware support would be 100% for SSE2, but that wouldn't change the complexity of the software and of having to maintain more concurrent versions of it. The SSE to SSE2 port would still require just as much effort to carry out, and if sufficiently different, more work to maintain as well. AFAIK the only major difference across the codebase for different platforms is the x86 versions having assembler hotloops instead of c++. Different alignment requirements would require more widespread changes, and from the Akos client days of s4 there was extremely little performance gained from the change.
I actually made a SSE2 version once, not modifying the "hot loop", but other parts of the program (sin/cos LUT). It didn't gain much on some CPUs and was much slower on others (Akos said there _might_ be some advantage on Woodcrests). And yes, it required to rearrange the data for a larger part of the program. At that time, the hazzle of maintaining (and deploying) yet another different version of the code wasn't worth the minimal speedup on only a few CPUs.
For the techs: For the current Apps we maintain four ("production"-) versions of the source code (for the central function, BOINC and graphics is C++, the rest is plain vanilla C):
- Hand-coded Assembler used for all x86 CPUs capable of SSE
- Hand-coded Assembler for x87 calculations (for x86 CPUs that can't do SSE)
- An AltiVec version using Motorola's C/C++-API to AltiVec instructions
- A generic C version that runs on all other CPUs such as G3, MIPS and SPARC
Correction: the x86-64 Linux
)
Correction: the x86-64 Linux client, version 5.8.11, can be downloaded from boinc_5.8.11_x86_64-pc-linux-gnu.tgz (make sure to copy both files to the BOINC working directory). The new x64 Windows client, version 5.8.11, by Crunch3r, can be found at boinc_5.8.11_windows_amd64.zip.
Update on project applications:
* Chess960 (Linux)
*
ABC (Linux)
* ABC ß (Linux & Windows)
* Predictor (Linux)
* RieselSieve (Linux)
* 32-bit Application Sent to AMD64 Clients
* HashClash (Linux & Windows)
* Leiden (Linux)
* Malaria (Linux)
* Docking (Linux)
* RieselSieve (Windows)
* WCG (Linux)
*
Pirates (Linux)
For more information, see BoincStats Forum.
HTH
I would really really like to
)
I would really really like to see x86_64 supported by Einstein in some form. Native app would be best but short term a 32 bit app issued to 64 bit would be good still.
I tried just about everything to get the app working with an app info on my C2D with no luck and its stopping my fastest 2 cores from crunching Einstein during AA6.
RE: I'm not sure if they
)
The current Einstein App uses this method, i.e. doing "more contributing" parts of the calculation in high precision (80bit on FPU) while doing the rest in single precision (SSE). For the current setup doing everything in single precision isn't precise enough.
This complicated way of calculation, btw, is the reason why I couldn't simply compile a (native) 64bit App of the current code.
We are working on the code for S5R2, and it looks like it will become a lot cleaner, and probably everything in the "inner loop" can be done in single precision, so it will be a little faster and it should also be easier to build native 64bit Apps (yes, we do care).
BM
BM
RE: The current Einstein
)
Good to know!
But let me correct you in that although SSE supports only single-precision, SSE2 supports double-precision too. Of course, if Einstein really needs to use x87's extended-precision 80-bit, that's the only way to go.
And in case someone is wondering whether using SSE/SSE2 code side-by-side with x87 code is faster, it isn't, as both SSE/SSE2 and x87 share the same FPU, only through different interfaces.
HTH
RE: But let me correct you
)
I know that there is SIMD support for double precision, but 1) there are (or at least were at time of coding) much more machines that could do SSE but coudn't run SSE2 than that could run both, and 2) (re-)aligning the data for double precision SIMD calculation ate up all speed we would gain from doing the just four FPU calculations in two double precision SSE2 calculations. It simply wasn't worth the effort. [Edit] Modern CPUs with their "virtually two FPUs" (another interface to the same physical unit) will combine the FPU calculations for us anyway.
BM
BM
This is all a bit technical
)
This is all a bit technical for me, but is it safe to assume that any CPU currently capable of 64 bit supports at least SSE2?
In other words, would the problem with SSE vs SSE2 support be irrelevant for a 64 bit app?
I too have a 64bit (Core 2 Duo) machine that normally runs 64 bit Ubuntu. It's temporarily running Windows so it can participate at Einstein but I would much rather run 64 bit Linux (so it can do a lot of work at projects that have a fast 64 bit app).
Join the #1 Aussie Alliance on Einstein
It's not too complicated to
)
It's not too complicated to get current Einstein running under AMD64 linux. Details are highlited in this thread.
Metod ...
RE: This is all a bit
)
Hardware support would be 100% for SSE2, but that wouldn't change the complexity of the software and of having to maintain more concurrent versions of it. The SSE to SSE2 port would still require just as much effort to carry out, and if sufficiently different, more work to maintain as well. AFAIK the only major difference across the codebase for different platforms is the x86 versions having assembler hotloops instead of c++. Different alignment requirements would require more widespread changes, and from the Akos client days of s4 there was extremely little performance gained from the change.
I actually made a SSE2
)
I actually made a SSE2 version once, not modifying the "hot loop", but other parts of the program (sin/cos LUT). It didn't gain much on some CPUs and was much slower on others (Akos said there _might_ be some advantage on Woodcrests). And yes, it required to rearrange the data for a larger part of the program. At that time, the hazzle of maintaining (and deploying) yet another different version of the code wasn't worth the minimal speedup on only a few CPUs.
For the techs: For the current Apps we maintain four ("production"-) versions of the source code (for the central function, BOINC and graphics is C++, the rest is plain vanilla C):
- Hand-coded Assembler used for all x86 CPUs capable of SSE
- Hand-coded Assembler for x87 calculations (for x86 CPUs that can't do SSE)
- An AltiVec version using Motorola's C/C++-API to AltiVec instructions
- A generic C version that runs on all other CPUs such as G3, MIPS and SPARC
BM
BM
Thanks for the detailed
)
Thanks for the detailed explanation.
Then again, it shouldn't be too hard to send the 32-bit application to the 64-bit clients, as the number of projects already doing this confirm it.