The documentation was actually quite resourceful, and it helped me identify some key areas to focus on for the next week.
- SIMD Integration
- explore how this interacts with fftw_malloc (data alignment, etc), should be fairly straight-forward
- assmble codelets for default array formatting. Hopefully the 'stride' setting of load/store multiple actually works (last time I tried, it did not, to my disappointment), otherwise
- provide an Advanced / Guru interface that accepts arrays in pre-arranged format (as well as bijective mapper between the default and pre-arranged format)
- FFTW's planner tries out several possible methods for computing the DFT and chooses the fastest one. The planner results can be stored as 'Wisdom' for later reuse.
- it relies on direct access to hardware cycle counters, which are present on most supported platforms (although they say on certain embedded platforms this is not the case), otherwise gettimeofday() is used which has a major cost in overhead.
- look into the cycle-counter code and see if it can be implemented with any of the ARM timers, or if an actual cycle counter exists in hardware that can be accessed by userspace
I should also mention, that the NEON co-processor is (of course) incapable of double-precision floating-point, which is the default for fftw, and as far as I'm aware, the only way to 'accelerate' double-precision calculations would be to do them 'softly' in the DSP (any other suggestions?). So everything I'm speaking about above applies to single-precision floating-point only. My secondary goal about leveraging the DSP for (many) double-precision computations should address this problem, otherwise VFP must be used.