update-2010-08-20: you can download a consolidated patch of my fftw work (includes all of my changesets) from here. Alternatively, clone my git repository. You can find build instructions here.
Below are my benchmarks from today. They're really only final in terms of the final ones I'm submitting for GSOC. Yes, you might very well notice that 'fftwff' (i.e. ffmpeg called through fftw) dominates in most cases where power of two transforms are used. However, you should also notice, that fftw provides a nice dispatching mechanism for allowing one to use ffmpeg's blazingly fast transforms in order to compute higher-order transforms with a single api. Pretty nice!
As I already mentioned, these benchmarks do not contain the neon-simd optimized strided memory transfers that I started rather late in the game, but you can probably imagine, that for any composite transform (i.e. if N != m^p, or any transform of higher dimensionality) those memory transfer routines will improve the performance of ALL versions ('fftw3' 'fftwni' and 'fftwn', 'fftw3ff') by a good factor of 2x, at least. After that, once a couple of straight-forward small-prime routines (i.e. 3, 5, 7, 11) are implemented in neon-asm, then the non-power-of-two (npot) performance will begin to approach the power-of-two (pot) performance, which is ideal. That is currently what makes fftw so attractive for x86 architectures - the fact that it provides O(N log N) performance for all lengths, not just powers of two, and without zero-padding (for most transforms of significant length).
Please feel free to check out the raw data on Google Docs (bench-1d-pot, bench-1d-npot, bench-2d-pot, bench-2d-npot, bench-3d-pot, bench-3d-npot). Also, just in case it isn't clear, when 'mflops' is on the y-axis, a higher value is better, and when log10(time(s)) is on the y-axis, a smaller value is better.
Some remarks: ffmpeg does not have api calls for any length that is not a power of two, nor does it have api calls for higher-order (e.g. NxN) transforms, which is why its missing in some of the graphs below. Furthermore, ffmpeg (and most all other libraries) do not support transforms of vector data, and since I added the ffmpeg interface to fftw, it does now :-). In this sense, calling ffmpeg through fftw leverages the excellent planning and dispatching routines of fftw to provide the absolute best possible performance. Eventually, it will be unlikely that anyone would directly used ffmpeg's transforms other than for decoding multimedia - but I think that's best left unchanged. Ffmpeg is likely the best library publicly available for multimedia.
In summary, my improvements to fftw include anything marked 'fftw3n' (neon simd, asm), 'fftw3ni' (neon simd, intrinsics), or 'fftw3ff' (neon simd, ffmpeg). Actually, even those that are marked 'fftw3' are still improved from their original values due to the added cycle counter... and you can count on me adding those missing features once my thesis is submitted ;-)