Over the last week, I did some more benchmarking, and also began working ffmpeg_fft into my benchmarks for a good target to aim for. I'm currently working out a segfault in the ffmpeg_fft library (specifically in ff_fft_permute_neon.S) and hope to have a few more pretty graphs to show for next time. I added fftwni to the benchmarks in order to compare codelets coded in inline asm versus codelets coded in neon intrinsics (also started using -O3 for -finline-functions), and the results were very close for fftwn (inline neon asm) and fftwni (inline neon intrinsics). The one exception I made was that vtrn was coded in inline asm to prevent compiler errors revolving around being unable to spill registers. Also, I'm changing the graphs to directly show cycles (or time) instead of mflops. The next major step will be working power-of-two fft 'algorithms' into the fftw planner. After that I'll be tackling the much more interesting non-power-of-two algorithms. By directly coding specific algorithms in asm, we're hoping to achieve a greater speedup than what is possible with codelets. The asm routines do not necessarily need to completely displace the codelets from the point of view of the planner (at least not without hard proof that they are always faster), so they can possibly augment the pool of algorithms available to the planner.
I will need to have some serious crunch time for my thesis as well, since the submission deadline is in 19 days (!), so you might not see as many commits in my repositories this week. Please feel free to look at the latest in main and misc.