As a follow up to my last post, I thought that I would show everyone a pretty graph with FFMpeg's FFT performance benched against FFTW for powers of two. I think it's obvious who the clear winner is from the image below.
Just for clarification purposes, 'fftw3' means 'fftw without neon simd optimizations', 'fftw3n' means 'fftw with inline asm neon simd optimizations', and 'fftwni' means 'fftw with gcc intrinsic neon simd optimizations'.
All versions of FFTW were compiled with CFLAGS="-O3 -pipe -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -fno-tree-vectorize". You can get the source at gitorious.org/gsoc2010-fftw-neon.
Now, a few comments.
First of all, NEON intrinsics perform almost identically to inline asm in the 'codelet' department, which really just goes to show exactly what one can expect from GCC's NEON optimization process (i.e. it is not optimized).
Secondly, well optimized assembler will almost always outperform what the compiler is capable of (arguably even when there is a hardware optimizer, see OoOE). The NEON coprocessor does not have an OoOE execution unit (although Cortex-A9 has one for the multiple ARM cores), which means that optimization with NEON is entirely up to the compiler and the programmer.
Lastly, FFMpeg's implementation is only for powers of two (even with very large radices), which means that the way non-power-of-two transforms would be computed with FFMpeg, would be to first be zero-pad the input. Zero-padding is also possible with FFTW, but it must be done before making the call to fftw*_plan. In other words, FFTW does not explicitly zero-pad data (as far as I'm aware, but I could be wrong). From what I understand (based on some random memory fragment suggesting that I read this somewhere a long time ago), programs such as Matlab actually do zero-pad all data before passing it to the FFTW library.
So, for the purists out there, or those who use embedded devices with (really!) small amounts of RAM, this is where FFTW becomes very attractive. FFTW uses several well published algorithms specifically for non-power-of-two problems to efficiently compute transforms without zero-padding. An alternate interpretation is that it's better for large input signal lengths (in the order of 2^27+1), if one were that naive, since the next power-of-two could max out available RAM. That's up for some debate anyway, and I don't know very many people who would do that on an ARM device (thankfully).
Now, the next plan is to determine how to augment the existing plans that FFTW has, to use similarly optimized power-of-two algorithms to those from FFMpeg. Then, the next step would be to tackle some of the non-power-of-two algorithms and try to rewrite them in NEON assembler. According to FFTW, even those non-power-of-two algorithms are O(NlogN), so perhaps (just perhaps), they will show a similar level of performance when written directly with ASM.
In short, I now have a visual depiction of what I should be shooting for. I will keep you posted on how well that goes.
PS: If you're wondering who Mru is, he's indeed my mentor for GSOC and maintainer for ARM portions of FFMpeg.