Compiling BRP for AARCH64-Linux

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188503424

RAC: 215307

RE: RE: I found the

17 Jun 2016 13:17:52 UTC

Message 138523 in response to message 138521

(moderation:

)

Quote:

Quote:
I found the problem I was running into. I forgot to remove a configure option (--enable-maintainer-mode) needed when directly building the master branch. I now can build fftw 3.3.2 by patching the two files in this commit: Double precision Neon SIMD for aarch64 and patching the configure script accordingly.

I don't suggest you to do that, simply add NEON_CFLAGS="-D__ARM_NEON__" as option to FFTW's configure and you ready. There is nothing more needed. I've tried to apply these patches too, it's not worth the trouble. Result is the same as adding this flag.

By applying the patch and patching the configure script I can basically use the same configure command as for armv7 which sets the option too. I'm also using the same Makefile as for armv7. Overall it's less changes and they are more easily incorporated in our current build script.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

RE: RE: But beside of

17 Jun 2016 15:01:37 UTC

Message 138524 in response to message 138522

(moderation:

)

Quote:

Quote:
But beside of that, here are the minimum changes to get a working copy:
build.sh:
- BINUTILS: 2.26 - GSL-Version: 1.16
- LIBXML-Version 2.9.3

I had no problem with binutils but had to add the build type to the configure commands for gsl and libxml because config.guess is too old.

Yes, that is also possible, I've decided to change the Version. Both are fine i guess.

Quote:

Quote:
einsteinbinary-Makefile:
- demod_binary_resamp_cpu.c: - ffp-contract=off

I don't understand this change. You mean I should add the compiler option to the demod_binary_resamp_cpu.o target?

Yes, without this option you getting many invalid results from this line:
del_t[i] = params->tau * sinValue * params->step_inv - params->S0;
I think the problem is that AARCH64 supports fused-multiply-subtracts. ARMv7 only have fused-multiply-accumulate. So this line gives you sometimes different Values. And the resampling is really sensitive to this.

Quote:

Quote:
FFTW-Version:
3.3.3 seem's to be the fastest but 3.3.2 works also.
I'm currently running a version (via app_info) that was build using 3.3.2 to get a baseline but than will try one using 3.3.3 next.

You should note here that 3.3.3 gives you only better Times if a Wisdom is presented.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188503424

RAC: 215307

Thanks for the explanation. I

17 Jun 2016 16:17:32 UTC

Message 138525

(moderation:

)

Thanks for the explanation. I will incoporate those into my next iteration of testing. Do you already have a wisdom file? If you send it to me I can put that into our git too.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

Yes I have already a

18 Jun 2016 6:40:16 UTC

Message 138526 in response to message 138525

(moderation:

)

Yes I have already a Wisdom-file. But it's for an out-of-place FFT. We have 2gb of RAM on the C2 and we need ~208M per Task. So i think that's okay.

const char * EMBEDDED_WISDOM =
"(fftw-3.3.3 fftwf_wisdom #x4a633eef #xb5a95564 #x91014bdd #x9c85ce5f"
"  (fftwf_codelet_n2fv_32_neon 1 #x31bff #x31bff #x0 #x77f8eb56 #x9b3243ea #xcfaa4341 #x0775bcaa)"
"  (fftwf_codelet_t3fv_16_neon 1 #x1040 #x1040 #x0 #xe303c5b2 #xc2ae2214 #xa72ab5f4 #x6995a04c)"
"  (fftwf_codelet_t3fv_32_neon 0 #x31bff #x31bff #x0 #x78aaf7d5 #x0ead6a1d #x9ea9500c #xfe4649ee)"
"  (fftwf_codelet_hc2cbdftv_12_neon 0 #x30bff #x30bff #x0 #xed7d2717 #x182d4499 #xb650d8bf #xc3ac709e)"
"  (fftwf_codelet_n1fv_32_neon 0 #x31bff #x31bff #x0 #x1377ac76 #x486b5979 #x85f7d06e #x99d80ab3)"
"  (fftwf_codelet_r2cb_12 2 #x30bff #x30bff #x0 #x0cd86b8c #xa5bb5bdf #xa6841a6f #x2b51bb34)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #x10aa5232 #xccc0b4f9 #x63b0f397 #x4046d871)"
"  (fftwf_codelet_t1fv_12_neon 0 #x1040 #x1040 #x0 #x79756f0e #x66d09426 #xc0f7c2b4 #x261a84b3)"
"  (fftwf_ct_genericbuf_register 0 #x30bff #x30bff #x0 #x09ffadfd #xdb59a068 #x0745df6d #xd58d3904)"
"  (fftwf_codelet_r2cfII_12 2 #x31bff #x31bff #x0 #x3bf1ef07 #x3d06dd3e #x565dfc8a #x2b7c20c9)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #x74dd935c #x94ceb996 #x09d11935 #x41c5b235)"
"  (fftwf_codelet_r2cf_12 2 #x31bff #x31bff #x0 #xf0420ce7 #xc918dcf0 #x03aac9b2 #x16107661)"
"  (fftwf_codelet_hc2cfdftv_2_neon 0 #x1040 #x1040 #x0 #xf2e21ede #xa7926244 #x904b58ef #x5516abdd)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #x592190c3 #x7ce845cd #xb138a247 #xbaa61ebe)"
"  (fftwf_codelet_n2fv_32_neon 1 #x30bff #x30bff #x0 #x77f8eb56 #x9b3243ea #xcfaa4341 #x0775bcaa)"
"  (fftwf_codelet_t1buv_8_neon 0 #x30bff #x30bff #x0 #x89648b87 #x4428e205 #x9c1eb28f #x0e1b59df)"
"  (fftwf_codelet_r2cfII_2 2 #x1040 #x1040 #x0 #x7d17401b #xbead8c34 #x59d0bca0 #x4ce1dcef)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #xfe6deb84 #x4f26ad5c #xb890d5fc #xa90ed671)"
"  (fftwf_codelet_r2cbIII_12 2 #x30bff #x30bff #x0 #xfb67d341 #x537f52c4 #xbaa6c92c #x64c28e12)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #x098ff363 #x2e742041 #xf8ba4623 #x3d99eadb)"
"  (fftwf_ct_genericbuf_register 0 #x31bff #x31bff #x0 #xfe3a0fe3 #xb55c134b #x0645bd4a #xf197f7c6)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #x792a7736 #x4fc700e1 #xe3e5f7fa #x7534e533)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #xb02371f5 #xa5458024 #x6d46a518 #x009c8e76)"
"  (fftwf_codelet_t3fv_32_neon 0 #x31bff #x31bff #x0 #xb8f247fc #xb8fa53ba #x7d5cec88 #x6a2cc555)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #x2d4f7b39 #xe89c78f3 #xf04db27c #x71312c69)"
"  (fftwf_codelet_t3fv_16_neon 1 #x1040 #x1040 #x0 #x4c7d44eb #xc8fdb88f #x5c58f633 #x11913e40)"
"  (fftwf_codelet_r2cf_2 2 #x1040 #x1040 #x0 #x7f491169 #x040dd9bd #xd46830ed #x3084e984)"
"  (fftwf_codelet_t3fv_32_neon 0 #x30bff #x30bff #x0 #xb8f247fc #xb8fa53ba #x7d5cec88 #x6a2cc555)"
"  (fftwf_codelet_t3fv_32_neon 1 #x1040 #x1040 #x0 #x6401dea5 #xb86b1548 #x336ceb05 #x5ea75d6c)"
"  (fftwf_codelet_n2fv_64_neon 1 #x1040 #x1040 #x0 #x6174f23c #xfb0fa51d #xd769129d #xfb18817d)"
"  (fftwf_codelet_hc2cfdftv_12_neon 0 #x31bff #x31bff #x0 #xf2e21ede #xa7926244 #x904b58ef #x5516abdd)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #xb02371f5 #xa5458024 #x6d46a518 #x009c8e76)"
"  (fftwf_codelet_n1bv_128_neon 0 #x30bff #x30bff #x0 #x5551ea8e #x0745fccd #x49992db0 #x34d9d629)"
")";

Computed over 8day's directly out of a "dummy"-BRP-App.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188503424

RAC: 215307

First test run with "stock"

18 Jun 2016 7:12:30 UTC

Message 138527

(moderation:

)

First test run with "stock" app (fftw 3.3.2, no ffp-contract change):
4 tasks of which 1 valid, 1 inconclusive, 2 waiting on wingman - Runtime ~48k sec (13 h) with all 4 running at the same time

Host

The 2 in progress are using the same app but with fftw 3.3.3 and the ffp-contract change.

I'm also not using the -mcpu=cortex-a53+simd compiler flag (yet) because I want to have a generic app. Right now we can't target specific ARM cpu's with BOINC. Let's see what performance gain I get when I activate this in one of the next tests.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

seem's to be identical with

18 Jun 2016 8:26:39 UTC

Message 138528 in response to message 138527

(moderation:

)

seem's to be identical with my test's. ~48ks
Switching to FFTW to out-of-place should give you result's ~25ks.
Applying the wisdom and you should get ~17-18ks.
The rest are my changes off the resampling.

The ffp-contract=off makes it a bit slower. That's why I only applied this to the to the resampling.
Maybe we can avoid this by making some variables volatile and change the order a little bit. To force GCC to make a MUL and a SUB instead of a MLS. So that we still have the speed-benefit's from the MLA instruction. switching ffp-contract=off eliminates this to.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188503424

RAC: 215307

And now I'm back to 48k sec

20 Jun 2016 9:43:14 UTC

Message 138529

(moderation:

)

And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

RE: And now I'm back to 48k

21 Jun 2016 7:11:49 UTC

Message 138530 in response to message 138529

(moderation:

)

Quote:

And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.

Have you switched to an out-of-place fft? My wisdom is for out-of-place. If you using it for inplace, it will just ignored.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188503424

RAC: 215307

I switched to out-of-place

21 Jun 2016 13:47:02 UTC

Message 138531

(moderation:

)

I switched to out-of-place fft now and also try to cool the C2 better in case it is downclocking (which I couldn't verify).

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

By looking at your last

22 Jun 2016 8:34:43 UTC

Message 138532 in response to message 138531

(moderation:

)

By looking at your last run-times it doesn't seem that it was successfull?
I would guess that your FFTW-Patch doesn't activate neon properly.

Maybe you could try, only for testing, my method (NEON_CFLAGS="-D__ARM_NEON__")?
If that doesn't help I can send you my build.sh & makefile. (But it's only quick&dirty modified)

I've also finished my resampling but I want to write a little explanation. And I don't have the Time. My wife has late shift this week so I have to work, buy Food, cook, go with the dog, ... But at the Weekend I send it to you.
Sorry for the delay.

Compiling BRP for AARCH64-Linux

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports