high performance discrete fourier transforms on graphics processors
DESCRIPTION
High Performance Discrete Fourier Transforms on Graphics Processors. Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith , John Manferdelli Microsoft Corporation. Discrete Fourier Transforms (DFTs). - PowerPoint PPT PresentationTRANSCRIPT
High Performance Discrete Fourier Transforms on
Graphics Processors
Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko,Burton Smith , John Manferdelli
Microsoft Corporation
Discrete Fourier Transforms (DFTs)
• Given an input signal of N values f(n), project it onto a basis of complex exponentials
– Often computed using Fast Fourier Transforms (FFTs) for efficiency
• Fundamental primitive for signal processing– Convolutions, cryptography, computational fluid dynamics,
large polynomial multiplications, image and audio processing, etc.
• A popular HPC benchmark– HPC Challenge benchmark– NAS parallel benchmarks
1
0
/2)()(N
n
NiknenfkF
DFT: Challenges
• HPC Challenge 2008– DFT on Cray XT3: 0.9 TFLOPS– HPL: 17 TFLOPS
• Complex memory access patterns– Limited data reuse– For a balanced system, if compute-to-memory
ratio doubles, the cache size needs to be squared for the system to be balanced again [Kung86]
• Architectural issues– Cache associativity, memory banks
4
GPU: Commodity Processor
Cell phones Consoles
PSPDesktops
Parallelism in GPUs
GPU MemoryDRAM DRAM DRAM DRAM DRAM DRAM
TPC TPC TPC TPC TPC
DRAM DRAM
TPC TPC TPC TPC TPC
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
GPU Memory
Domain
Programmability
GPU MemoryDRAM DRAM DRAM DRAM DRAM DRAM
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPCLo
cal M
emor
ySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
TPC
Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP Loca
l Mem
orySP
SP
SP
SP
SP
SP
SP
SP
Thread Execution Manager
Thread Block
loca
l mem
ory
RegsRegs
Regs
Thread Block
loca
l mem
ory
RegsRegs
RegsHigh-level programming abstractions:
Microsoft DirectX11, OpenCL, NVIDIA CUDA, AMD CAL, etc.
Discrete Fourier Transforms
• Objectives:– Efficiency: Achieve high performance exploiting the
memory hierarchy and high parallelism– Accuracy: Design algorithms that achieve
comparable numerical accuracy with CPU libraries– Scalability: Demonstrate scalable performance based
on underlying hardware capabilities• Focus on computing single-precision DFTs that fit
in GPU memory– Demonstrate DFT performance of 100-300 GFLOPS
per GPU for typical large sizes– Concepts applicable to double-precision algorithms
FFT Overview
FFT Overview
FFT along columnsFFT along rows
Transpose
Registers (16K)
Globalmemory (1GB)
Shared memory(16KB/multi-processor)
Significant literature on FFT algorithms. Detailed survey in [Van Loan 92]
DFTs on GPUs: Challenges
• Coalescing issues– Access contiguous blocks of data to achieve high
DRAM bandwidth
• Bank conflicts– Affine access patterns can map to same banks
• Transpose overheads– Reduce memory access overheads
• Occupancy– Require several threads to hide memory latency
Outline
• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms
• Experimental Results• Conclusions and Future Work
Outline
• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms
• Experimental Results• Conclusions and Future Work
Overview
• Global Memory Algorithm– Large N– Uses high memory bandwidth of GPUs
• Shared Memory Algorithm– Small N– Data re-use in shared memory of GPU MPs
• Hierarchical Algorithm– Intermediate sizes– Combines data transposes with shared memory
algorithm
Global memory algorithm
• Proceeds in logRN steps (radix=R)
• Decompose N into blocks B, and threads T such that B*T=N/R
• Each thread:– reads R values from global memory– multiplies by twiddle factors– performs an R-point FFT– writes R values back to global memory
Global Memory Algorithm
Thread 0 Thread 1 Thread 2 Thread 3
N/R
Rj
R=4Step j=1
If N/R > coalesce width (CW), no coalescing issues during reads
If Rj> CW, no coalescing issues during writes
If Rj <=CW, write to shared memory, rearrange data across threads, write to global memory with coalescing
Shared memory algorithm
• Applied when FFT is computed on data in shared memory of a MP
• Each block has N*M/R threads– M is number of FFTs performed together in a block– Each MP performs M FFTs at a time
• Similar to global memory algorithm– Use Stockham formulation to reduce compute
overheads
Shared Memory Algorithm
Thread 0 Thread 1 Thread 2 Thread 3
N/R
Rj
R=4Step j=1
If N/R > numbanks, no bank conflicts during reads
If Rj> numbanks, no bank conflicts during writes
Shared Memory Algorithm
Thread 0 Thread 1 Thread 2 Thread 3
N/R
Rj
R=4Step j=1
If Rj <=numbanks, add padding to avoid bank conflicts
Thread 4 Thread 5 Thread 6 Thread 7
0Banks 4 8 12 0 4 8 12
Hierarchical FFT
• Decompose FFT into smaller-sized FFTs– Evaluate efficiently using shared memory
algorithm– Combine transposes with FFT computation– Achieve memory coalescing
Multiprocessor
Shared Memory
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
SP SP SP SPSP SP SP SPSP SP SP SPSP SP SP SP
DRAM DRAM DRAM DRAM DRAM DRAM
Hierarchical FFT
W=N/H
CW
H
Perform CW FFTs of size H in shared memory
…
Hierarchical FFT
W=N/H
H
Perform H FFTs of size W recursivelyTranspose
In-place algorithmFinal set of transposes can also be combined with FFT computation
Other FFTs
• Non-Power-of-Two sizes– Mixed Radix
• Using powers of 2, 3, 5, etc.– Bluestein’s FFT
• For large prime factors
• Multi-dimensional FFTs– Perform FFTs independently along each dimension
• Real FFTs– Exploit symmetry to improve the performance– Transformed into a complex FFT problem
• DCTs– Computed using a transformation to complex FFT problem
Microsoft DFT LibraryKey features supported in our GPU DFT libraryDimension •1D
•2D•3D
Algorithms •Shared memory•Global memory•Hierarchical
Data type •Single•Real•Complex
Runtime •Auto-tuning•Virtualization
Size •Large prime factors•2a, 3b, etc.•Mixed-radix•Multiple transforms
Outline
• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms
• Experimental Results• Conclusions and Future Work
Experimental Methodology
• Hardware– Intel QX9650 3.0 GHz quad-core processor• Two dual core dies • Each pair of cores shares 6 MB L2 cache
– NVIDIA GTX280 GPU• Driver version 177.41
Name Multi-procs
Shader Clock (MHz)
Peak Perf. (GFlops)
Memory Capacity
(MB)
Peak Bandwidth
(GiB/s)
GTX280 30 1300 930 1024 140
8800 GTX 16 1300 520 768 80
8800 GTS 16 1625 620 512 60
Experimental Methodology
• Libraries– Our FFT library written in CUDA• Tested on various GPUs
– NVIDIA’s CUFFT library (v. 1.1)• Results for GTX280 only
– DX9FFT library [Lloyd et al. 2007]• Results for GTX280 only
– Intel’s MKL (v. 10.0.2)• Run on CPU with 4 threads
Experimental Methodology
• Notation– N: Size of the FFT– M: Number of FFTs
• Performance– GFlops: M 5N lg(N) / time– Minimum time over multiple runs
• Warm caches on CPU
• Accuracy– Perform forward transform and inverse– Compare result to original input– Root mean square error (RMSE) / 2
1D Single FFT
0102030405060708090
100
4 6 8 10 12 14 16 18 20 22 24
GFl
ops
log2N
Ours GTX280
Ours 8800GTX
Ours 8800GTS
CUFFT
MKL
M = 1
1D Multi-FFT
M = 223 / N*Driver 177.11
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23
GFl
ops
log2 N
Ours GTX280*Ours GTX280
MKL
Ours 8800GTSCUFFTDX9FFT
Entire FFT in shared memory kernel
1
2
4
8
16
32
64
1 3 5 7 9 11 13 15 17 19 21 23
Rel.
runti
me
(log)
log2N
MKLCUFFTOurs 8800GTSOurs GTX280
1D Multi-FFT
M = 223 / N
40x
20x
5x
1D Mixed Radix
0
20
40
60
80
100
120
GFl
ops
N
OursCUFFTMKL
N = 2a3b5c M= 223/N
1D Primes
02468
10121416
GFl
ops
N
Ours
MKL
CUFFT
M= 220/N
02468
101214161820
4 5 7 9 11 13 15 17 19 21
GFl
ops
log2 N
Ours
MKL
1D Large Primes
M= 222/N
RMSE Error (N=2a)
1.0E-8
1.0E-7
1.0E-6
1 3 5 7 9 11 13 15 17 19 21 23
Erro
r
log2N
CUFFT
Ours
Accurate
MKL
RMSE Error (Mixed radix)
1.0E-7
1.0E-6
1.0E-5
6 8 10 12 14 16 18 20 22
Erro
r
log2N
CUFFT
Ours
MKL
RMSE Error (primes)
1.0E-7
1.0E-6
1.0E-5
1.0E-4
1.0E-3
1.0E-2
Erro
r (lo
g 10)
N
CUFFTOursMKL
Limitations
• Current implementation– Works only on data in GPU memory– No multi-GPU support– No support for double precision
• Hardware Issues– Large data sizes needed to fully utilize GPU– Slow data transfer between GPU and system memory– High accuracy twiddle factors are slow
• Use a table (especially for double precision)– Need to virtualize block index
• Fixed in Microsoft DirectX11
Outline
• FFT Algorithms– Global Memory– Shared Memory– Hierarchical Memory– Other FFT algorithms
• Experimental Results• Conclusions and Future Work
Conclusions
• Several algorithms for performing FFTs on GPUs– Handle different sizes efficiently– Library chooses appropriate algorithms for a given size
and hardware configuration– Optimized for memory performance
• Combined transposes with FFT computation– Address numerical accuracy issues
• High performance– Up to 300 GFLOPS on current high-end GPUs– Significantly faster than existing GPU-based libraries
and CPU-based libraries for typical large sizes
Future Work
• More sophisticated auto-tuning• Add additional functionality:– Double precision– Multi-GPU support– Out-of-core support for very large FFTs
• Port to DirectX11 using Compute Shaders
Future of GPGPU
• GPUs are becoming more general purpose– Fewer limitations. Microsoft DirectX11 API:• IEEE floating point support and optional double support• Integer instruction support• More programmable stages, etc.
– Significant advance in performance– Higher level programming languages– Uniform abstraction layer over different hardware
vendors
Future of GPGPU• Widespread adoption of GPUs in commercial
applications– Image and media processing, signal processing,
finance, etc.• High performance computing– Can benefit from data-parallel programming– Many opportunities– Microsoft GPU Station at Booth number 1309
Acknowledgments
• Microsoft: Chas Boyd, Craig Mundie, Ken Oien• NVIDIA: Henry Moreton, Sumit Gupta, and
David• Peter-Pike Sloan• Vasily Volkov