I found the problem I was running into. I forgot to remove a configure option (--enable-maintainer-mode) needed when directly building the master branch. I now can build fftw 3.3.2 by patching the two files in this commit: Double precision Neon SIMD for aarch64 and patching the configure script accordingly.
I don't suggest you to do that, simply add NEON_CFLAGS="-D__ARM_NEON__" as option to FFTW's configure and you ready. There is nothing more needed. I've tried to apply these patches too, it's not worth the trouble. Result is the same as adding this flag.
By applying the patch and patching the configure script I can basically use the same configure command as for armv7 which sets the option too. I'm also using the same Makefile as for armv7. Overall it's less changes and they are more easily incorporated in our current build script.
I had no problem with binutils but had to add the build type to the configure commands for gsl and libxml because config.guess is too old.
Yes, that is also possible, I've decided to change the Version. Both are fine i guess.
Quote:
Quote:
einsteinbinary-Makefile:
- demod_binary_resamp_cpu.c: - ffp-contract=off
I don't understand this change. You mean I should add the compiler option to the demod_binary_resamp_cpu.o target?
Yes, without this option you getting many invalid results from this line: del_t[i] = params->tau * sinValue * params->step_inv - params->S0;
I think the problem is that AARCH64 supports fused-multiply-subtracts. ARMv7 only have fused-multiply-accumulate. So this line gives you sometimes different Values. And the resampling is really sensitive to this.
Quote:
Quote:
FFTW-Version:
3.3.3 seem's to be the fastest but 3.3.2 works also.
I'm currently running a version (via app_info) that was build using 3.3.2 to get a baseline but than will try one using 3.3.3 next.
You should note here that 3.3.3 gives you only better Times if a Wisdom is presented.
Thanks for the explanation. I will incoporate those into my next iteration of testing. Do you already have a wisdom file? If you send it to me I can put that into our git too.
First test run with "stock" app (fftw 3.3.2, no ffp-contract change):
4 tasks of which 1 valid, 1 inconclusive, 2 waiting on wingman - Runtime ~48k sec (13 h) with all 4 running at the same time
The 2 in progress are using the same app but with fftw 3.3.3 and the ffp-contract change.
I'm also not using the -mcpu=cortex-a53+simd compiler flag (yet) because I want to have a generic app. Right now we can't target specific ARM cpu's with BOINC. Let's see what performance gain I get when I activate this in one of the next tests.
seem's to be identical with my test's. ~48ks
Switching to FFTW to out-of-place should give you result's ~25ks.
Applying the wisdom and you should get ~17-18ks.
The rest are my changes off the resampling.
The ffp-contract=off makes it a bit slower. That's why I only applied this to the to the resampling.
Maybe we can avoid this by making some variables volatile and change the order a little bit. To force GCC to make a MUL and a SUB instead of a MLS. So that we still have the speed-benefit's from the MLA instruction. switching ffp-contract=off eliminates this to.
And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.
And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.
Have you switched to an out-of-place fft? My wisdom is for out-of-place. If you using it for inplace, it will just ignored.
By looking at your last run-times it doesn't seem that it was successfull?
I would guess that your FFTW-Patch doesn't activate neon properly.
Maybe you could try, only for testing, my method (NEON_CFLAGS="-D__ARM_NEON__")?
If that doesn't help I can send you my build.sh & makefile. (But it's only quick&dirty modified)
I've also finished my resampling but I want to write a little explanation. And I don't have the Time. My wife has late shift this week so I have to work, buy Food, cook, go with the dog, ... But at the Weekend I send it to you.
Sorry for the delay.
RE: RE: I found the
)
By applying the patch and patching the configure script I can basically use the same configure command as for armv7 which sets the option too. I'm also using the same Makefile as for armv7. Overall it's less changes and they are more easily incorporated in our current build script.
RE: RE: But beside of
)
Yes, that is also possible, I've decided to change the Version. Both are fine i guess.
Yes, without this option you getting many invalid results from this line:
del_t[i] = params->tau * sinValue * params->step_inv - params->S0;
I think the problem is that AARCH64 supports fused-multiply-subtracts. ARMv7 only have fused-multiply-accumulate. So this line gives you sometimes different Values. And the resampling is really sensitive to this.
You should note here that 3.3.3 gives you only better Times if a Wisdom is presented.
Thanks for the explanation. I
)
Thanks for the explanation. I will incoporate those into my next iteration of testing. Do you already have a wisdom file? If you send it to me I can put that into our git too.
Yes I have already a
)
Yes I have already a Wisdom-file. But it's for an out-of-place FFT. We have 2gb of RAM on the C2 and we need ~208M per Task. So i think that's okay.
Computed over 8day's directly out of a "dummy"-BRP-App.
First test run with "stock"
)
First test run with "stock" app (fftw 3.3.2, no ffp-contract change):
4 tasks of which 1 valid, 1 inconclusive, 2 waiting on wingman - Runtime ~48k sec (13 h) with all 4 running at the same time
Host
The 2 in progress are using the same app but with fftw 3.3.3 and the ffp-contract change.
I'm also not using the
-mcpu=cortex-a53+simd
compiler flag (yet) because I want to have a generic app. Right now we can't target specific ARM cpu's with BOINC. Let's see what performance gain I get when I activate this in one of the next tests.seem's to be identical with
)
seem's to be identical with my test's. ~48ks
Switching to FFTW to out-of-place should give you result's ~25ks.
Applying the wisdom and you should get ~17-18ks.
The rest are my changes off the resampling.
The ffp-contract=off makes it a bit slower. That's why I only applied this to the to the resampling.
Maybe we can avoid this by making some variables volatile and change the order a little bit. To force GCC to make a MUL and a SUB instead of a MLS. So that we still have the speed-benefit's from the MLA instruction. switching ffp-contract=off eliminates this to.
And now I'm back to 48k sec
)
And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.
RE: And now I'm back to 48k
)
Have you switched to an out-of-place fft? My wisdom is for out-of-place. If you using it for inplace, it will just ignored.
I switched to out-of-place
)
I switched to out-of-place fft now and also try to cool the C2 better in case it is downclocking (which I couldn't verify).
By looking at your last
)
By looking at your last run-times it doesn't seem that it was successfull?
I would guess that your FFTW-Patch doesn't activate neon properly.
Maybe you could try, only for testing, my method (NEON_CFLAGS="-D__ARM_NEON__")?
If that doesn't help I can send you my build.sh & makefile. (But it's only quick&dirty modified)
I've also finished my resampling but I want to write a little explanation. And I don't have the Time. My wife has late shift this week so I have to work, buy Food, cook, go with the dog, ... But at the Weekend I send it to you.
Sorry for the delay.