can we teach computers to write fast...
TRANSCRIPT
Carnegie Mellon
Can We Teach Computers
To Write Fast Libraries?
This work was supported by
DARPA DESA program, NSF-NGS/ITR, NSF-ACR, NSF-CPA, and Intel
Markus Püschel
Collaborators:
Franz Franchetti
Yevgen Voronenko
Electrical and
Computer Engineering
Carnegie Mellon University
… and the Spiral team (only part shown)
Carnegie Mellon
Hiroshige, Temple in park near Osaka, ca. 1853-59
Carnegie Mellon
The Problem: Example DFT
Standard desktop computer, vendor compiler, using optimization flags
All implementations have roughly the same operations count (~ 4nlog2(n))
Maybe the DFT is just difficult?
0
5
10
15
20
25
30
16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144
input size
..
Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single precision)
Performance [Gflop/s]
12x
30x
Numerical recipes
Best code
Carnegie Mellon
The Problem: Example MMM
Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, …
What’s going on?
0
5
10
15
20
25
30
35
40
45
50
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000
matrix size
Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double precision)Performance [Gflop/s]
160x
Triple loop
Best code (K. Goto)
Carnegie Mellon
Evolution of Processors (Intel)
Carnegie Mellon
Evolution of Processors (Intel)
Carnegie Mellon
Evolution of Processors (Intel)
Era of
parallelism
High performance library development becomes increasingly difficult
Carnegie Mellon
DFT Plot: Analysis
0
5
10
15
20
25
30
16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144
input size
..
Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz
Gflop/s
Memory hierarchy: 5x
Vector instructions: 3x
Multiple threads: 2x
High performance library development has become a nightmare
Carnegie Mellon
Current Solution
Legions of programmers implement and optimize the same
functionality for every platform and whenever a new platform
comes out.
Carnegie Mellon
Better Solution: Automatic Performance Tuning
Automate (parts of) the implementation or optimization
Research efforts Linear algebra: Phipac/ATLAS (UTK),
Sparsity/Bebop (Berkeley), Flame (UT Austin)
Tensor computations (Ohio State)
PDE/finite elements: Fenics
Adaptive sorting (UIUC)
Fourier transform: FFTW (MIT)
Linear transforms: Spiral
…others
New compiler techniques
Proceedings of the IEEE special issue, Feb. 2005
Promising new area but more work needed
In particular for parallelism …
Carnegie Mellon
Organization
Spiral: Brief overview
Parallelization in Spiral
Results
Conclusion
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan
Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W.
Johnson, and Nick Rizzolo, SPIRAL: Code Generation for DSP Transforms,
Proceedings of the IEEE 93(2), 2005
Carnegie Mellon
Vision Behind Spiral
Numerical problem
Computing platform
algorithm selection
compilation
hu
ma
n e
ffo
rta
uto
ma
ted
implementation
C program
au
tom
ate
d
algorithm selection
compilation
implementation
Numerical problem
Computing platform
Current Future
C code a singularity: Compiler has
no access to high level information
Challenge: conquer the high abstraction
level for complete automation
Carnegie Mellon
Spiral Library generator for linear transforms
(DFT, DCT, DWT, filters, ….) and recently more …
Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog
Research Goal: “Teach” computers to write fast libraries Complete automation of implementation and optimization Conquer the “high” algorithm level for automation
When a new platform comes out: Regenerate a retuned library
When a new platform paradigm comes out (e.g., vector or CMPs):Update the tool rather than rewriting the library
Intel has started to use Spiral to generate parts of their MKL/IPP library
Carnegie Mellon
How Spiral Works
Algorithm Generation
Algorithm Optimization
Implementation
Code Optimization
Compilation
Compiler Optimizations
Problem specification (transform)
algorithm
C code
Fast executable
performance
Sea
rch
controls
controls
Spiral
Spiral:
Complete automation of the
implementation and
optimization task
Basic ideas:
Declarative representation
of algorithms
Rewriting systems to
generate and optimize
algorithms at a high level
of abstraction
Carnegie Mellon
What is a (Linear) Transform?
Mathematically: Matrix-vector multiplication
Example: Discrete Fourier transform (DFT)
input vector (signal)
output vector (signal) transform = matrix
Carnegie Mellon
Transform Algorithms: Example 4-point FFT
Cooley/Tukey fast Fourier transform (FFT):
Algorithms are divide-and-conquer: Breakdown rules
Mathematical, declarative representation: SPL (signal processing language)
SPL describes the structure of the dataflow
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
j j
j j j
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
Carnegie Mellon
Examples: Transforms (currently 55)
Carnegie Mellon
Examples: Breakdown Rules (currently ≈220)
Base case rules
Combining these rules yields many algorithms for every given transform
Carnegie Mellon
Breakdown Rules (220)BSkewDFT 4
BSkewDFT_Base2
BSkewDFT_Base4
BSkewDFT_Fact
BSkewDFT_Decomp
BSkewPRDFT 1
BSkewPRDFT_Decomp
BSkewRDFT 1
BSkewPRDFT_Decomp
Circulant 10
RuleCirculant_Base
RuleCirculant_toFilt
RuleCirculant_Blocking
RuleCirculant_BlockingDense
RuleCirculant_DiagonalizeStep
RuleCirculant_DFT
RuleCirculant_RDFT
RuleCirculant_PRDFT
RuleCirculant_PRDFT1
RuleCirculant_DHT
CosDFT 2
CosDFT_Base
CosDFT_Trigonometric
DCT1 3
DCT1_Base2
DCT1_Base4
DCT1_DCT1and3
DCT2 5
DCT2_DCT2and4
DCT2_toRDFT
DCT2_toRDFT_odd
DCT2_PrimePowerInduction
DCT2_PrimeFactor
DCT3 5
DCT3_Base2
DCT3_Base3
DCT3_Base5
DCT3_Orth9
DCT3_toSkewDCT3
DCT4 8
DCT4_Base2
DCT4_Base3
DCT4_DCT2
DCT4_DCT2t
DCT4_DCT2andDST2
DCT4_DST4andDST2
DCT4_Iterative
DCT4_BSkew_Decomp
PolyDTT 4
PolyDTT_ToNormal
PolyDTT_Base2
PolyDTT_SkewBase2
PolyDTT_SkewDST3_CT
RDFT 4
RDFT_Base
RDFT_Trigonometric
RDFT_CT_Radix2
RDFT_toDCT2
RHT 2
RHT_Base
RHT_CooleyTukey
SinDFT 2
SinDFT_Base
SinDFT_Trigonometric
SkewDFT 2
SkewDFT_Base2
SkewDFT_Fact
SkewDTT 3
SkewDTT_Base2
SkewDCT3_VarSteidl
SkewDCT3_VarSteidlIterative
Toeplitz 6
RuleToeplitz_Base
RuleToeplitz_SmartBase
RuleToeplitz_PreciseBase
RuleToeplitz_Blocking
RuleToeplitz_BlockingDense
RuleToeplitz_ExpandedConvolution
UDFT_Base4
UDFT_SR_Reg
URDFT 4
URDFT1_Base1
URDFT1_Base2
URDFT1_Base4
URDFT1_CT
WHT 8
WHT_Base
WHT_GeneralSplit
WHT_BinSplit
WHT_DDL
WHT_Dirsum
WHT_tSPL_BinSplit
WHT_tSPL_STerm
WHT_tSPL_Pease
PDHT2 2
PDHT2_Base2
PDHT2_CT
PDHT3 4
PDHT3_Base2
PDHT3_CT
PDHT3_Trig
PDHT3_CT_Radix2
PDHT4 3
PDHT4_Base2
PDHT4_CT
PDHT4_Trig
PDST4 2
PDST4_Base2
PDST4_CT
PRDFT 10
PRDFT1_Base1
PRDFT1_Base2
PRDFT1_CT
PRDFT1_Complex
PRDFT1_Complex_T
PRDFT1_Trig
PRDFT1_PF
PRDFT1_CT_Radix2
PRDFT_PD
PRDFT_Rader
PRDFT2 4
PRDFT2_Base1
PRDFT2_Base2
PRDFT2_CT
PRDFT2_CT_Radix2
PRDFT3 6
PRDFT3_Base1
PRDFT3_Base2
PRDFT3_CT
PRDFT3_Trig
PRDFT3_OddToPRDFT1
PRDFT3_CT_Radix2
PRDFT4 4
PRDFT4_Base1
PRDFT4_Base2
PRDFT4_CT
PRDFT4_Trig
PolyBDFT 5
PolyBDFT_Base2
PolyBDFT_Base4
PolyBDFT_Fact
PolyBDFT_Trig
PolyBDFT_Decomp
Filt 10
RuleFilt_Base
RuleFilt_Nest
RuleFilt_OverlapSave
RuleFilt_OverlapSaveFreq
RuleFilt_OverlapAdd
RuleFilt_OverlapAdd2
RuleFilt_KaratsubaSimple
RuleFilt_Circulant
RuleFilt_Blocking
RuleFilt_KaratsubaFast
IPRDFT 5
IPRDFT1_Base1
IPRDFT1_Base2
IPRDFT1_CT
IPRDFT1_Complex
IPRDFT1_CT_Radix2
IPRDFT2 4
IPRDFT2_Base1
IPRDFT2_Base2
IPRDFT2_CT
IPRDFT2_CT_Radix2
IPRDFT3 3
IPRDFT3_Base1
IPRDFT3_Base2
IPRDFT3_CT
IPRDFT4 3
IPRDFT4_Base1
IPRDFT4_Base2
IPRDFT4_CT
IRDFT 1
IRDFT_Trigonometric
InverseDCT 1
InverseDCT_toDCT3
MDCT 1
MDCT_toDCT4
MDDFT 5
MDDFT_Base
MDDFT_Tensor
MDDFT_RowCol
MDDFT_Dimless
MDDFT_tSPL_RowCol
PDCT4 2
PDCT4_Base2
PDCT4_CT
PDHT 3
PDHT1_Base2
PDHT1_CT
PDHT1_CT_Radix2
DRDFT 1
DRDFT_tSPL_Pease
DSCirculant 2
RuleDSCirculant_Base
RuleDSCirculant_toFilt
DST1 3
DST1_Base1
DST1_Base2
DST1_DST3and1
DST2 3
DST2_Base2
DST2_toDCT2
DST2_DST2and4
DST4 2
DST4_Base
DST4_toDCT4
DTT 15
DTT_C3_Base2
DTT_IC3_Base2
DTT_C3_Base3
DTT_C3_Base3
DTT_S3_Base2
DTT_S3Poly_Base2
DTT_S3Poly_Base3
DTT_S3_Base2
DTT_C4_Base2
DTT_C4_Base2_LS
DTT_C4Poly_Base2
DTT_S4_Base2
DTT_S4_Base2_LS
DTT_S4Poly_Base2
DTT_DIF_T
DWT 5
DWT_Filt
DWT_Base2
DWT_Base4
DWT_Base8
DWT_Lifting
DWTper 4
DWTper_Mallat_2
DWTper_Mallat
DWTper_Polyphase
DWTper_Lifting
DCT5 1
DCT5_Rader
DFT 22
DFT_Base
DFT_Canonize
DFT_CT
DFT_CT_Mincost
DFT_CT_DDL
DFT_CosSinSplit
DFT_Rader
DFT_Bluestein
DFT_GoodThomas
DFT_PFA_SUMS
DFT_PFA_RaderComb
DFT_SplitRadix
DFT_DCT1andDST1
DFT_DFT1and3
DFT_CT_Inplace
DFT_SplitRadix_Inplace
DFT_PD
DFT_SR_Reg
DFT_PRDFT
DFT_GT_CT
DFT_tSPL_CT
DFT_DFTBR
DFT2 2
DFT2_CT
DFT2_PRDFT2
DFT3 5
DFT3_Base
DFT3_CT
DFT3_CT_Radix2
DFT3_OddToDFT1
DFT3_PRDFT3
DFT4 3
DFT4_Base
DFT4_CT
DFT4_PRDFT4
DFTBR 1
DFTBR_HW
DHT 3
DHT_Base
DHT_DCT1andDST1
DHT_Radix2
Cooley-Tukey FFT
Breakdown rules in Spiral:
“Teaches” Spiral about existing algorithm knowledge
(~200 journal papers)
Includes many new ones (algebraic theory)
Carnegie Mellon
SPL to Sequential Code
Example: tensor product
Correct code: easy fast code: very difficult
Carnegie Mellon
Program Generation in Spiral (Sketched)
Transform
user specified
C Code:
Fast algorithm
in SPL
many choices
∑-SPL:
[PLDI 05]
Iteration of this process
to search for the fastest
But that’s not all …
parallelization
vectorization
loop
optimizations
constant folding
scheduling
……
Optimization at all
abstraction levels
Carnegie Mellon
Organization
Spiral: Brief overview
Parallelization in Spiral
Results
Conclusion
Carnegie Mellon
SPL to Shared Memory Code: Basic Idea [SC 06]
Governing construct: tensor product
p-way embarrassingly parallel, load-balanced
A
A
A
A
x y
Processor 0
Processor 1
Processor 2
Processor 3
Problematic construct: permutations produce false sharing
Task: Rewrite formulas to
extract tensor product + keep contiguous blocks
x y
Carnegie Mellon
Step 1: Shared Memory Tags
Identify crucial hardware parameters
Number of processors: p
Cache line size: μ
Introduce them as tags in SPL
This means: formula A is to be optimized for p processors
and cache line size μ
Carnegie Mellon
Step 2: Identify “Good” Formulas
Load balanced, avoiding false sharing
Tagged operators (no further rewriting necessary)
Definition: A formula is fully optimized for (p, μ) if it is one of
the above or of the form
where A and B are fully optimized.
Carnegie Mellon
Step 3: Identify Rewriting Rules
Goal: Transform formulas into fully optimized formulas
Formulas rewritten, tags propagated
There may be choices
Carnegie Mellon
Simple Rewriting Example
fully optimized
Loop splitting + loop exchange
Carnegie Mellon
Parallelization by Rewriting
Fully optimized (load-balanced, no false sharing)
in the sense of our definition
Carnegie Mellon
Same Approach for Other Parallel Paradigms
Vectorization: [IPDPS 02, VecPar 06]Message Passing: [ISPA 06]
Cg/OpenGL for GPUs: Verilog for FPGAs: [DAC 05]
MPI
With Bonelli, Lorenz, Ueberhuber, TU Vienna
With Shen, TU DenmarkWith Milder, Hoe, CMU
Carnegie Mellon
Going Beyond Transforms
Transform = linear operator with one vector input and one vector output
Key ideas: Generalize to (possibly nonlinear) operators with several inputs and several
outputs
Generalize SPL (including tensor product) to OL (operator language)
Generalize rewriting systems for parallelizations
We have first results beyond transforms!
Cooley-Tukey FFT in OL:
Viterbi in OL:
Mat-Mat-Mult:
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad core
GPU FPGA
DFT
All Spiral code shown
is “push-button” generated
from scratch
“click”
FPGA+CPU
Carnegie Mellon
DFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]
0
5
10
15
20
25
30
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
input size
Spiral 5.0 SPMD
Spiral 5.0 sequential
Intel IPP 5.0
FFTW 3.2 alpha SMP
FFTW 3.2 alpha sequential
Benchmark: Vector and SMP
2 processors
4 processors
2 processors 4 processors
Memory footprint < L1$ of 1 processor
25 Gflop/s!
4-way vectorized + up to 4-threaded + adapted to the memory hierarchy
Carnegie Mellon
DCT4, Multiples of 32: 4-way VectorizedDCT (single precision) 2.66 GHz Core2 (4-way 32-bit SSE)
performance [Gflop/s]
0
1
2
3
4
5
6
7
8
9
10
32 64 96 128 160 192 224 256
input size
Spiral DCT4
FFTW 3.1.2 DCT4 (k11)
IPP 5.1 DCT2 (?)
Less popular transform usually means no good code
Carnegie Mellon
Benchmark: Cell (1 processor = SPE)DFT (single precision) on 3.2 GHz Cell BE (Single SPE)performance [Gflop/s]
0
2
4
6
8
10
12
14
16
16 32 48 64 80 96 128 160 256 384 512 1024
input size
Generated using the simulator; run at Mercury (thanks to Robert Cooper)
Joint work with Th. Peter (ETH Zurich), S. Chellappa, M. Telgarsky, J. Moura (CMU)
Carnegie Mellon
Benchmark: GPU
0
1
2
3
4
5
6
8 10 12 14 16 18 20 22 24
log2(input size)
WHT (single precision) on 3.6 GHz Pentium 4 with Nvidia 7900 GTXperformance [Gflops/s]
Spiral CPU Spiral GPU
Joint work with H. Shen, TU Denmark
CPU + GPU
Carnegie Mellon
Benchmark: FPGA
DFT 256 on Xilinx Virtex 2 Pro FPGAinverse throughput (gap) [us]
0
1
2
3
4
5
6
7
8
0 5000 10000 15000 20000 25000
Area [slices]
Xilinx Logicore 3.2
Spiral
better
Joint work with P. Milder, J. Hoe (CMU)
competitive with professional designs
much larger set of performance/area trade-offs
(Pareto-optimal designs)
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad core
GPU FPGA
DFT
filter
GEMM
Viterbi
All Spiral code shown
is “push-button” generated
from scratch
“click”
FPGA+CPU
Carnegie Mellon
Beyond Transforms : Viterbi Decoding
Viterbi decoding (8-bit) on 2.66 GHz Core 2 Duoperformance [Gbutterflies/s]
0
0.5
1
1.5
2
2.5
6 7 8 9 10 11 12 13
log2(size of state machine)
Spiral 16-way vectorized
Spiral scalar
Karn's Viterbi decoder
(hand-tuned assembly)
1 butterfly = ~22 ops
Vectorized using practically the same rules as for DFT
Joint work with S. Chellappa, CMU Karn: http://www.ka9q.net/code/fec/
Carnegie Mellon
Beyond Transforms: Matrix-Matrix-Multiply
work with F. de Mesmay, CMU
MMM(square matrix real double) 2 x Core2Duo 3Ghz
0
10
20
30
40
50
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638
performance [Gflop/s]
matrix side
Spiral generated (scalar)
ATLAS contributed
(pthreads+SSE)
Goto (pthreads+SSE)
Spiral generated
(pthreads + SSE)
ATLAS generated (scalar)
theoretical peak
All measurements with warm cache
ATLAS code is optimized for cold cache
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
Carnegie Mellon
Spiral Summary
Spiral: The computer implements and optimizes transforms Rigorous formal framework
Support for different platform paradigms (vector, SMP, GPU, FPGA, …)
Optimization at high abstraction level through rewriting+ empirical search over alternatives
The generated code is often faster than human written code
We have extended the framework beyond transforms
What we have learned Declarative representation of algorithms (mathematical DSL)
Optimization at a high, “right” level of abstraction using rewriting
It makes sense to use math to represent and optimize math functionality
It makes sense to “teach” the computer algorithms and math (does not become obsolete)
Domain-specific is necessary to get fastest code
One needs techniques from different disciplines …..
Carnegie Mellon
Attacking the Parallel Library ChallengeAutomation in High Performance Library Development
Programming languages
Program generation Symbolic Computation
Rewriting
Compilers
Software
Scientific ComputingAlgorithms
Mathematics
High-Performance
Parallel Library
Development
We Need to Work Together