Monday, May 31, 2010

Some Insight

Well, this was a very busy weekend for me because I was moving out of my apartment here in Kiel. I'll be couch-surfing here for another two weeks before heading back to Montreal. Amid the chaos I took some time, lying down on the floor in my completely white-washed apartment with no furniture, to read some FFTW documentation and check out some of the more interesting corners of its codebase.

The documentation was actually quite resourceful, and it helped me identify some key areas to focus on for the next week.

  1. SIMD Integration
    • explore how this interacts with fftw_malloc (data alignment, etc), should be fairly straight-forward
    • assmble codelets for default array formatting. Hopefully the 'stride' setting of load/store multiple actually works (last time I tried, it did not, to my disappointment), otherwise
    • provide an Advanced / Guru interface that accepts arrays in pre-arranged format (as well as bijective mapper between the default and pre-arranged format)
  2. FFTW Wisdom
    • FFTW's planner tries out several possible methods for computing the DFT and chooses the fastest one. The planner results can be stored as 'Wisdom' for later reuse.
    • it relies on direct access to hardware cycle counters, which are present on most supported platforms (although they say on certain embedded platforms this is not the case), otherwise gettimeofday() is used which has a major cost in overhead.
    • look into the cycle-counter code and see if it can be implemented with any of the ARM timers, or if an actual cycle counter exists in hardware that can be accessed by userspace
  3. Perform some very initial rough-optimization of double-precision operations using the VFP and load/store multiple (requires alignment). 

I should also mention, that the NEON co-processor is (of course) incapable of double-precision floating-point, which is the default for fftw, and as far as I'm aware, the only way to 'accelerate' double-precision calculations would be to do them 'softly' in the DSP (any other suggestions?). So everything I'm speaking about above applies to single-precision floating-point only. My secondary goal about leveraging the DSP for (many) double-precision computations should address this problem, otherwise VFP must be used.

Tuesday, May 4, 2010

Hello, GSOC world!

What better way to start of the fftw-neon blog than with a greeting in the best of K&R traditions?

In case you were wondering, the reason that this blog exists is to document my progress through the summer of 2010, working on a project funded through the Google Summer of Code. Specifically, the intention of this project is to improve performance of FFTW for NEON-enabled ARM processors.

What is FFTW, you ask? Well, to explain that, I should probably first direct interested readers to a few articles which cover the basics about the Fourier Transform (FT), the Discrete Fourier Transform (DFT), and also the Fast Fourier Transform (FFT). What all of these transforms have in common is that they represent signals in the frequency-domain as opposed to the time domain. What kind of signals? Well... essentially everything in nature that is measurable can be expressed as a signal. For example music, speech, mechanical waves in a structure, voltages in an electrical circuit, radio waves, heart beats, earthquakes, brainwaves, and so on.

The DFT and FFT differ from the FT in that they operate on digital (or discrete) information, that can be processed by a digital computer. Although the DFT or the FFT both operate on the time-domain representation of a signal, converting it to a frequency-domain representation (often called a spectrum), the FFT can perform this transformation incredibly faster than the DFT, in spite of providing identical results (in certain cases). The FFT takes advantage of certain symmetries in the DFT, and eliminates redundant calculations, which typically requires signals to have lengths equivalent to some power of 2.

Finally, to answer the original question, FFTW is a finely tuned software library that leverages certain hardware in modern computers to perform the FFT even faster. This is typically achieved by performing several operations in parallel or in sequence, or by reducing the number of accesses over the main memory bus. FFTW is probably most famous for its use in the Matlab, GNU Octave, and GNURadio, although it certainly has seen several other important applications.

To date, FFTW has mainly been optimized for x86 and PowerPC processors. However, as mobile devices are becoming more advanced (reaching GHz clock speeds), we are seeing ARM processors moving out of the traditional embedded market and into more general computing devices; the mobile processors of today are computing powerhouses compared to those of 10 years ago. ARM powered-devices are also now available in the laptop / netbook form factor that has traditionally been dominated by the x86 architecture. There has even been recent mention of multi-core ARM devices moving into the server market.

This is where NEON comes into play. The ARM Cortex-A8 architecture was the first to introduce the Advanced SIMD instruction set (NEON) which can perform several floating-point operations (even more integer operations) in parallel. Combining that with vector instructions for loading and storing data, a significant speedup is possible over the generic code generated by GCC. Applying those optimizations to FFTW's codebase will enable the next generation of ARM devices (in both the handheld and netbook formats) to be used for scientific computing, or really any application that uses the DFT / FFT.

Well, that's it for my initial post. If you're interested in following the progress of this project, be sure to follow this blog. Alternatively, feel free to subscribe to the mailing list and monitor the git repository for new code.