# Spiral.net Algorithm and Architecture Optimizations for Large Size www.spiral.net Two Dimensional Discrete Fourier Transform Carnegie Mellon.

Berkin Akin, Peter Milder, Franz Franchetti and James Hoe Carnegie Mellon Univeristy, Pittsburgh PA USA {bakin, pam, franzf, jhoe}@ece.cmu.edu

DRAN

### Overview

Large size 2D Fast Fourier Transform

•Used in image processing, scientific computing



Typical datasets are *large* and *high precision*!
 *e.g.* 2K-by-2K double precision 2D-FFT:

### Memory access pattern and achieved bandwidth

- Have large strided DRAM access pattern
- Does not exploit DRAM
   row-buffer locality

• Results in *low* memory bandwidth utilization!

### 1024-by-1024 double precision 2D-FFT



- e.g. 2K-by-2K double precision 2D-FFT.
  Input dataset: 64 MB
  # of operations: ~461.4 Mflop
- Does not fit on-chip
  Stored *off-chip*

Electrical & Computer ENGINEERING



- Memory bandwidth becomes **bottleneck** for achieving high performance
- Effective *bandwidth orchestration* is required for:

**1** Performance **2** Bandwidth Efficiency **3** Power Efficiency

### Background

### **DFT** is matrix-vector multiplication

 $y = \text{DFT}_n \cdot x, \quad \text{DFT}_n = [e^{-2\pi i k\ell/n}]_{0 \le k, \ell < n}$ 

**FFT** algorithm is factorization of DFT matrix

$$\begin{split} DFT_4 = \begin{pmatrix} DFT_2 \otimes I_2 \end{pmatrix} \prod_{j=1}^4 (I_2 \otimes DFT_2) \prod_{j=1}^4 \\ & \text{tensor twiddle factors} \quad \text{permutation} \end{split}$$

**2D-FFT** operates on 2D data, *e.g. images* 

## **2D-FFT algorithms**

Row column algorithm:

 $\mathrm{DFT}_{n \times n} = (\mathrm{DFT}_n \otimes \mathrm{I}_n)(\mathrm{I}_n \otimes \mathrm{DFT}_n)$ 

Row-wise and column-wise accesses!



### **DRAM operation**

- Need to make use of every row touched to *maximize bandwidth*
- Large strides result in *small packets* of transferred data

#### DDR2-800 Bandwidth on DE4 (per channel) Bandwidth [GB/s] vs. Packet size [KB]













### **Solution: Algorithm and Architecture**

### **Restructured algorithm**

- Linear data mapping in DRAM causes row and column-wise accesses
- Use 2D tiled data mapping where each tile is mapped to a DRAM row
- Restructure the algorithm given 2D data mapping



- Data is accessed as *tiles*, not row and column-wise
- Row-buffer misses are minimized!

### From algorithm to hardware





Matching throughput to memory bandwidth:

- Achieved via fine-grain control over datapath *parallelism*
- Results in *balanced* design



FIFO

- Ensuring *continuous dataflow*:
- -Buffers are used to smooth the flow of data.

Target *application*:

-Double precision complex 2D-FFT

-Data sizes up to 2,048-by-2,048 Target *platforms*:

|                               | Core i7<br>960 | GTX<br>480 | Stratix IV (DE4)<br>EP4SGX530 |
|-------------------------------|----------------|------------|-------------------------------|
| DRAM Type                     | DDR3           | GDDR5      | DDR2                          |
| # of Memory Channels          | 3              | 6          | 2                             |
| Memory BW (GB/s)              | 25.6           | 177.4      | 12                            |
| On-chip Memory (MB)           | 8              | 1.69       | 2.53                          |
| Proc. Freq. (MHZ)             | 3,200          | 1,401      | 200                           |
| # of Cores                    | 4              | 480        | N/A                           |
| Technology Node (nm)          | 45             | 40         | 40                            |
| Application<br>Infrastructure | Spiral         | CUDA 4.0   | Spiral/Verilog                |

### Raw performance:

GTX 480

Core i7 960

256x256

128x128

80

70

60

50

40

30

20

10

64x64

**2D-FFT Raw Performance (double precision)** Performance [Gflop/s] vs. Problem Size 90

**2D-FFT Bandwidth Efficiency (double precision)** Bandwidth normalized performance [(Gflop/s)/(GB/s)] vs. Problem Size



**Bandwidth Efficiency:** 

### **Power\* Efficiency:**

\* Measured power consumption including DRAMs

#### 2D-FFT Power Efficiency (double precision)

Power normalized performance [(Gflop/s)/Watt] vs. Problem Size



The authors acknowledge the support of the C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.

**Evaluation**