Optimizing FFTW for NEON-Enabled ARM Devices: Some Insight

Well, this was a very busy weekend for me because I was moving out of my apartment here in Kiel. I'll be couch-surfing here for another two weeks before heading back to Montreal. Amid the chaos I took some time, lying down on the floor in my completely white-washed apartment with no furniture, to read some FFTW documentation and check out some of the more interesting corners of its codebase.

The documentation was actually quite resourceful, and it helped me identify some key areas to focus on for the next week.

SIMD Integration

explore how this interacts with fftw_malloc (data alignment, etc), should be fairly straight-forward
assmble codelets for default array formatting. Hopefully the 'stride' setting of load/store multiple actually works (last time I tried, it did not, to my disappointment), otherwise
provide an Advanced / Guru interface that accepts arrays in pre-arranged format (as well as bijective mapper between the default and pre-arranged format)

FFTW Wisdom

FFTW's planner tries out several possible methods for computing the DFT and chooses the fastest one. The planner results can be stored as 'Wisdom' for later reuse.
it relies on direct access to hardware cycle counters, which are present on most supported platforms (although they say on certain embedded platforms this is not the case), otherwise gettimeofday() is used which has a major cost in overhead.
look into the cycle-counter code and see if it can be implemented with any of the ARM timers, or if an actual cycle counter exists in hardware that can be accessed by userspace

Perform some very initial rough-optimization of double-precision operations using the VFP and load/store multiple (requires alignment).

I should also mention, that the NEON co-processor is (of course) incapable of double-precision floating-point, which is the default for fftw, and as far as I'm aware, the only way to 'accelerate' double-precision calculations would be to do them 'softly' in the DSP (any other suggestions?). So everything I'm speaking about above applies to single-precision floating-point only. My secondary goal about leveraging the DSP for (many) double-precision computations should address this problem, otherwise VFP must be used.

Optimizing FFTW for NEON-Enabled ARM Devices

Monday, May 31, 2010

Some Insight

No comments:

Post a Comment

Followers

Blog Archive

About Me