implementing the fft on gpus tips & trickscomputer-graphics.se/multicore/pdf/2013/12a fft on...
TRANSCRIPT
1
LINKÖPING UNIVERSITY
Linköping, 2013
IMPLEMENTING THE FFT ON GPUs
TIPS & TRICKS
Department of Electrical
Engineering
Mario Garrido Gálvez [email protected]
2
MARIO GARRIDO
Associate Professor at ES, ISY
PhD in Electrical Engineering (Spain).
Research background:
Optimized implementation of signal processing algorithms.
Transforms (FFT, STFT,…), statistical operations
(regressions, median filter,…).
Data management (matrix transposition, interleavers,…).
Hardware designer (FPGAs, ASICs,…).
3
A STORY ABOUT GPUs
4
< 2011
Mainly FFTs on FPGAs.
Hundreds of papers in the topic since the 70’s.
Is not everything done???
OPTIMIZE
OPTIMIZE
OPTIMIZE
5
DFT / FFT
Discrete Fourier Transform / Fast Fourier Transform.
The most widely used algorithm in signal processing
- Audio and Image Processing. - 3G, 4G.
- Medical applications: EEG, ECG. - ADSL.
6
…,2011,…
FFT FPGA FFT FFT FPGA FPGA FFT FPGA
FPGA FFT FPGA FFT FFT FPGA FPGA FFT…
I should do something new!!
What about GPUs?…Shouldn’t it be the same…
7
FPGA vs GPU
FPGA GPU
Altera Cyclone II NVIDIA Fermi
8
…,2011,…
Started Master Thesis (Sreehari Ambuluri): FFTs on GPUs.
Read articles and a book on GPUs.
Asked Ingemar, Jens, Gabriel.
9
…,2012,…
Finish the Master Thesis.
The work is good.
Why not to improve it and
publish a paper?
Asked Ingemar, Jens and
Gabriel for collaboration.
10
…,2013,…
11
…,2013 BEST PAPER AWARD
12
NVIDIA
NVIDIA WE
13
LEVEL OF ABSTRACTION
The compiler decides We decide
COMPILATION
COMPILATION
DESCRIPTION
OPTIMIZATION
&
DESCRIPTION
ALGORITHM ALGORITHM =
IMPLEMENTATION IMPLEMENTATION <
DE
SIG
N P
RO
CE
SS
High level of abstraction Low level of abstraction
14
ABSTRACTION vs PERFORMANCE
z = 15 * x;
z = (x<<3)+(x<<2)+(x<<1) +x;
z = (x<<4)-x;
x z
15
x << 3
z
<< 2
<< 1
x << 4
z
LANGUAGE
DESCRIPTION HARDWARE
IMPLEMENTATION
15
UNDERSTANDING GPUs
1.- The performance is related to the computation time. The lower
the computation time, the higher the performance. Try to simplify
the operations in the algorithm.
2.- Transactions to global memory very expensive. Try to avoid or
minimize. Try to use shared memory.
3.- Threads must be synchronized if we want to share information
among them. Unless they are in the same warp. Try to reduce the
number of synchronization points.
4.- We have to calculate the index of the data processed by each
thread. Try to minimize the number of index calculations.
5.- Threads process data in parallel and the synchronization is not
possible until all the threads have finished the calculations. Balance
the load among thread.
FFT FLOW GRAPH (RADIX -2)
4 03
7
6
5
4
0
07
15
11
2
3
1
0
0
4
0
0
0
0
0
0
4
6 4
0
2
0 0
0
5
13
1
9
6
14
2
10
0
0
0
0
0
0
4
0
0
0 0
0
4
12
0
8
14
15
13
12
10
11
8
9
7
6
1
5
4
3
2
0
STAGE 1 STAGE 2 STAGE 3 STAGE 4
4
6
2
0
0
0
0
0
x[n]
n
X[k]
k
logrN
stages
butterflies
(radix-2)
rotations
Nj
e
2
N points
16
1. SIMPIFY THE ALGORITHM
17
USE RADIX-22
2. USE SHARED MEMORY
18
3. REDUCE SYNC. POINTS 2-word group 4-word group
USE WORD GROUPS
19
4. REDUCE INDEX CALCULATIONS
USE CONSTANT GEOMETRY
Conventional flow graph Constant Geometry
20
5. BALANCE LOAD AMONG THREADS
USE SCHEDULING
Unbalanced scheduling Balanced scheduling
21
CONCLUSIONS
22
Optimization:
- Depends on the details and the level of abstraction.
- Requires to understand in-depth what you are doing.
Teamwork makes a difference.
GPUs are fun.
23