With AVX Haswell can send 256 bit vectors into the pipeline each clock tick, whereas Ivy still needed 2 128 bit vectors in 2 clocks. Intel slides say they want to enlarge the width to 512 bit in 2 years or so. Sounds like something worth using.. if it's not too much hassle.
And - with all due respect - for the next few years AVX support will liekly gain you more throughput than Einstein@Android.
Currently the foundations are being laid so that the next generation GW app will use AVX (actually the best SIMD aritecture available on a given host). I think we might see this on E@H later this year.
The BRP app is mainly intended for GPU now and we won't touch the CPU code, I guess.
The FGRP (gamma ray pulsar search in FERMI/LAT data) app could benefit from an AVX enabled FFT.
If you implement AVX, make sure you have a way to deny 256-bit wide AVX to AMD Bulldozer and Piledriver processors and instead serve either SSE3 or 128-bit wide AVX plus FMA4 to those processors unless you prove that the 256-bit AVX meets a special case. See http://www.agner.org/optimize/ on why this should be done in most cases. The only advantage I can see to sending 256-bit AVX to those processors is if the programmer can fit the entire working set in the 256-bit registers and not in the 128-bit registers. If neither fit, 128-bit AVX and SSE3 are faster than 256-bit AVX due to some horrendous performance of the 256-bit registers when they need to be written out to memory especially in Piledriver. If both fit, then the 128-bit AVX or SSE is better because a 256-bit instruction takes two of the four shared decoders to decode while the 128-bit instruction uses just one. Bulldozer's set of four shared instruction decoders also has problems when handling 256-bit AVX instructions that must be split into two 128-bit instructions each because this set can only split one of these instructions per clock cycle, so a second 256-bit instruction could stall the decoder set.
Steamroller fixes these problems, so you should serve 256-bit wide AVX with optional FMA4 to this processor with no problem. I would expect the same for Excavator.
Maybe now, after almost two
)
Maybe now, after almost two years, it's time to think again about support of AVX / FMA (3/4)?
With AVX Haswell can send 256
)
With AVX Haswell can send 256 bit vectors into the pipeline each clock tick, whereas Ivy still needed 2 128 bit vectors in 2 clocks. Intel slides say they want to enlarge the width to 512 bit in 2 years or so. Sounds like something worth using.. if it's not too much hassle.
And - with all due respect - for the next few years AVX support will liekly gain you more throughput than Einstein@Android.
MrS
Scanning for our furry friends since Jan 2002
Any updates after one more
)
Any updates after one more year?
MrS
Scanning for our furry friends since Jan 2002
Later BOINC clients (ie 7.3
)
Later BOINC clients (ie 7.3 and 7.4) also report AVX feature if the CPU supports it.
BOINC blog
Paging Bernd for any updates
)
Paging Bernd for any updates
Fair question. Currently
)
Fair question.
Currently the foundations are being laid so that the next generation GW app will use AVX (actually the best SIMD aritecture available on a given host). I think we might see this on E@H later this year.
The BRP app is mainly intended for GPU now and we won't touch the CPU code, I guess.
The FGRP (gamma ray pulsar search in FERMI/LAT data) app could benefit from an AVX enabled FFT.
Cheers
HB
RE: the best SIMD
)
That would be the ideal solution :)
I'm curious: what do you have to do to realize this?
MrS
Scanning for our furry friends since Jan 2002
The time has come. The next
)
The time has come. The next GW run will surely support AVX, will it? ;-)
The new search on the
)
The new search on the advanced-generation LIGO detector data has an AVX app, although currently only for Linux.
MrS
Scanning for our furry friends since Jan 2002
If you implement AVX, make
)
If you implement AVX, make sure you have a way to deny 256-bit wide AVX to AMD Bulldozer and Piledriver processors and instead serve either SSE3 or 128-bit wide AVX plus FMA4 to those processors unless you prove that the 256-bit AVX meets a special case. See http://www.agner.org/optimize/ on why this should be done in most cases. The only advantage I can see to sending 256-bit AVX to those processors is if the programmer can fit the entire working set in the 256-bit registers and not in the 128-bit registers. If neither fit, 128-bit AVX and SSE3 are faster than 256-bit AVX due to some horrendous performance of the 256-bit registers when they need to be written out to memory especially in Piledriver. If both fit, then the 128-bit AVX or SSE is better because a 256-bit instruction takes two of the four shared decoders to decode while the 128-bit instruction uses just one. Bulldozer's set of four shared instruction decoders also has problems when handling 256-bit AVX instructions that must be split into two 128-bit instructions each because this set can only split one of these instructions per clock cycle, so a second 256-bit instruction could stall the decoder set.
Steamroller fixes these problems, so you should serve 256-bit wide AVX with optional FMA4 to this processor with no problem. I would expect the same for Excavator.