Research
I am currently working on automatic code generation and optimization for a parallel implementation of linear transform libraries for forthcoming parallel processors and systems. Specifically, I'm currently working on generating optimized libaries for the Cell Broadband Engine processor. The need for an automatic code generation system is motivated on the Spiral webpage.
Most current high performance vectorized, parallel implementations of the discrete Fourier transform (and its variants) for the Cell BE are hand-coded and manually tuned to work on a single transform/size, typically work only on two-power sizes, and are constrained to a single complex data format (interleaved or split complex). Spiral can now generate high performance code for DFTs of two-power and non-two power sizes, that use various complex data formats. Code that is designed to run on a specified number of SPEs can also be generated.
Each SPE on a 3.2GHz Cell BE processor has a theoretical peak performance of 25.6 Gflops. A sampling of the performance currently achieved using Spiral generated code is presented here. In addition to the being able to generate code for transforms of arbitrary sizes, the performance of Spiral generated code is comparable to or vastly exceeds the performance of existing libraries. Spiral generated parallel libraries use an SPMD (Single Program Multiple Data) model.
The approach used by Spiral is abstracted at a level that will allow it to be used on other shared memory, multicore, and distributed memory architectures.