Both my Linux and win64 hosts are doing well, no invalids, after dozens of units. The windows machine has about twice the runtime ( but now more consistently so ) than the Fedora. Specifically there are no units over 25 hours.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
In cleaning up the "tuning" run (that this thread is originally about) we just granted credit for the results of early app versions (before 1.04) that "lost" validation to the results of a 1.04 version.
Being able to do this was the only reason we kept these workunits in the DB, we'll now purge that run from the system.
In cleaning up the "tuning" run (that this thread is originally about) we just granted credit for the results of early app versions (before 1.04) that "lost" validation to the results of a 1.04 version.
Being able to do this was the only reason we kept these workunits in the DB, we'll now purge that run from the system.
BM
Thank you! Worked for three of the four tasks I crunched, but one did not get credits. Any reason for that?
Actually all our searches on Einstein@Home (Binary Pulsar Search, Fermi Gamma-Ray Pulsar Search and the GW search) use the FFT, and all of them would benefit from offloading the FFT to the GPU.
However, the Binary Radio Pulsar search code is by far the most optimized for GPU, we get a speed-up (with GPUs compared to CPU only) well greater than 10 (depending on the individual GPU and CPU of course). For the GW search, the FFT part of the computation takes only roughly half the computing time for CPUs, so offloading this to the GPU can at most speed up the computation by a factor of 2.
So currently, the best use for the GPUs on E@H is to do the Binary Radio Pular search, and that search only.
We may change this decision later depending on science priorities, tho.
For those who experience surprisingly poor performance of the GW search on their hardware (say more than 14 hrs with a recent CPU), and who like to experiment a bit, there is a "hidden" way to force the app to try a bit harder to fine-tune the FFT computation to their particular hardware.
You can set two environment variables so that the E@H science app sees them (e.g. you could define them systemwide for Windows or in the startup options for BOINC on Linux):
env. variable value
=====================================
LAL_FSTAT_FFT_PLAN_MODE PATIENT
LAL_FSTAT_FFT_PLAN_TIMEOUT 120
This will tell FFTW to spend (roughly) up to two minutes (120s) just on optimizing the FFT computation for your particular hardware. You can play around with even longer durations.
We do not expect this to have a dramatic effect on most hosts, and it can even lead to slightly worse runtime in some cases, so we did not enable this by default. It might help on some hosts tho where the default settings lead to very suboptimal runtime.
Does it mean you're currently not thinking about releasing a GW GPU application, because the current priority for GPUs is BRP ?
Do you know how much of the GW code could be ported to GPUs, or a very approximate possible speed-up factor on GPUs ?
Does it mean you're currently not thinking about releasing a GW GPU application, because the current priority for GPUs is BRP ?
Do you know how much of the GW code could be ported to GPUs, or a very approximate possible speed-up factor on GPUs ?
Correct, there are no GPU plans for the O1 GW search. As I wrote, the speedup would be limited by ca a factor of 2 for a rather straight forward offloading of the FFT ( compared to a factor of >>10 for the BRP app, which is by now almost completly running on the GPU).
I'm quite sure the other parts of the computation (besides FFT) can also be ported to GPUs, but we have no plans to do that in the near future.
Thanks for the clarification.
Hoping you'll change your decision later ;-) BRP6 should be finished this year, depending on how much BRP4G work is there (which is currently quite a lot), so I suppose a new GPU app will be required...
....This will tell FFTW to spend (roughly) up to two minutes (120s) just on optimizing the FFT computation for your particular hardware. You can play around with even longer durations.
HB
Does FFTW perform this check for each work unit as it starts, or just once for the hardware system, which it then remembers for that hardware system into the future?
Both my Linux and win64 hosts
)
Both my Linux and win64 hosts are doing well, no invalids, after dozens of units. The windows machine has about twice the runtime ( but now more consistently so ) than the Fedora. Specifically there are no units over 25 hours.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
In cleaning up the "tuning"
)
In cleaning up the "tuning" run (that this thread is originally about) we just granted credit for the results of early app versions (before 1.04) that "lost" validation to the results of a 1.04 version.
Being able to do this was the only reason we kept these workunits in the DB, we'll now purge that run from the system.
BM
BM
RE: In cleaning up the
)
Thank you! Worked for three of the four tasks I crunched, but
one did not get credits. Any reason for that?
So if the app uses the FFTW
)
So if the app uses the FFTW maybe it is possible to easy use of cuFFTW (cuda FFTW compatibility mode) to do some offload.
RE: So if the app uses the
)
Actually all our searches on Einstein@Home (Binary Pulsar Search, Fermi Gamma-Ray Pulsar Search and the GW search) use the FFT, and all of them would benefit from offloading the FFT to the GPU.
However, the Binary Radio Pulsar search code is by far the most optimized for GPU, we get a speed-up (with GPUs compared to CPU only) well greater than 10 (depending on the individual GPU and CPU of course). For the GW search, the FFT part of the computation takes only roughly half the computing time for CPUs, so offloading this to the GPU can at most speed up the computation by a factor of 2.
So currently, the best use for the GPUs on E@H is to do the Binary Radio Pular search, and that search only.
We may change this decision later depending on science priorities, tho.
For those who experience
)
For those who experience surprisingly poor performance of the GW search on their hardware (say more than 14 hrs with a recent CPU), and who like to experiment a bit, there is a "hidden" way to force the app to try a bit harder to fine-tune the FFT computation to their particular hardware.
You can set two environment variables so that the E@H science app sees them (e.g. you could define them systemwide for Windows or in the startup options for BOINC on Linux):
This will tell FFTW to spend (roughly) up to two minutes (120s) just on optimizing the FFT computation for your particular hardware. You can play around with even longer durations.
We do not expect this to have a dramatic effect on most hosts, and it can even lead to slightly worse runtime in some cases, so we did not enable this by default. It might help on some hosts tho where the default settings lead to very suboptimal runtime.
HB
Does it mean you're currently
)
Does it mean you're currently not thinking about releasing a GW GPU application, because the current priority for GPUs is BRP ?
Do you know how much of the GW code could be ported to GPUs, or a very approximate possible speed-up factor on GPUs ?
-----
RE: Does it mean you're
)
Correct, there are no GPU plans for the O1 GW search. As I wrote, the speedup would be limited by ca a factor of 2 for a rather straight forward offloading of the FFT ( compared to a factor of >>10 for the BRP app, which is by now almost completly running on the GPU).
I'm quite sure the other parts of the computation (besides FFT) can also be ported to GPUs, but we have no plans to do that in the near future.
Thanks for the
)
Thanks for the clarification.
Hoping you'll change your decision later ;-) BRP6 should be finished this year, depending on how much BRP4G work is there (which is currently quite a lot), so I suppose a new GPU app will be required...
-----
RE: ....This will tell FFTW
)
Does FFTW perform this check for each work unit as it starts, or just once for the hardware system, which it then remembers for that hardware system into the future?
Thanks!