spiral overview: automatic generation of dsp …jjohnson/2009-10/winter/cs650/lectures/... ·...
TRANSCRIPT
Carnegie Mellon, Drexel, UIUC
Jeremy Johnson (& SPIRAL Team)Drexel University
* The work is supported by DARPA DESA, NSF, and Intel.Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team.
SPIRAL Overview: Automatic Generation of DSP Algorithms & More*
CS 650 Program Generation and Opt.
Carnegie Mellon, Drexel, UIUC
Outline Introduction Challenge and goal
PreSPIRAL Formulas, programs, and rewrite rules
Multilinear Algebra and Block Recursive Algorithms
SPIRAL (straightline code) SPL and Ruletrees
SPIRAL (loops) ∑-SPL
SPIRAL (vector, parallel, and more) Tagged SPL
SPIRAL (library generation)
Operator Language Beyond transforms
Carnegie Mellon, Drexel, UIUC
Outline
Introduction Challenge and goal
PreSPIRAL
SPIRAL (straightline code)
SPIRAL (loops)
SPIRAL (vector, parallel, and more)
SPIRAL (library generation)
Operator Language
Carnegie Mellon, Drexel, UIUC
Challenge of Obtaining Efficient Code
Memory hierarchy: 5x
Vector instructions: 3x
Multiple threads: 2x
High performance library development has become a nightmare
Carnegie Mellon, Drexel, UIUC
Spiral Overview Research Goal: “Teach” computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization
Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) … more
Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid
Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code
Carnegie Mellon, Drexel, UIUC
SPIRAL: Abstraction Levels
Transform
Ruletree
SPL
Σ-SPL
C Code…for (i=0; i<8; i++) {t[2i+4]=x[i-4]+x[i+4];… }
problem specification
easy manipulation for search
vector/parallel optimizations
loop optimizations
C compiler
trad
ition
alhu
man
dom
ain
tool
sav
aila
ble
Carnegie Mellon, Drexel, UIUC
Program Generation in Spiral (Sketched)Transformuser specified
C Code:
Fast algorithmin SPL many choices
∑-SPL:[PLDI 05]
Iteration of this process to search for the fastest
But that’s not all …
parallelizationvectorization
loop optimizations
constant foldingscheduling……
Optimization at allabstraction levels
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL Formulas, programs, and rewrite rules [CSSP 90]
Multilinear Algebra and Block Recursive Algorithms [JSC 91]
SPIRAL (straightline code)
SPIRAL (loops)
SPIRAL (vector, parallel, and more)
SPIRAL (library generation)
Operator Language
Carnegie Mellon, Drexel, UIUC
DSP Algorithms & Matrix Factorizations
−
−
−−
=
−−−−−−
1000001001000001
1100110000110011
000010000100001
1010010110100101
111111
111111
iii
ii
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
Input vector
Output vector
input vector (signal)output vector (signal) transform = matrix
Carnegie Mellon, Drexel, UIUC
Rules and Dynamic Programming
DFT(16)
DFT(2) DFT(8)
DFT(16)
DFT(4) DFT(4)
Identity PermutationKronecker product
DFT(16)
DFT(8) DFT(2)
Diagonal matrix (twiddles)
Dynamic Programming (DP) searches over different applications of CT Rule using best transform found for recursive calls
Carnegie Mellon, Drexel, UIUC
Formulas to Code
t(0) = x(0); t(1) = x(2); t(2) = x(1); t(3) = x(3)u(0) = t(0) + t(1); u(1) = t(0) – t(1);u(2) = t(2) + t(3); u(3) = t(2) – t(3);v(0) = u(0); v(1) = u(1); v(2) = u(2); v(3) = i*u(3);y(0) = v(0) + v(2); y(2) = v(0) - v(2);y(1) = v(1) + v(3); y(3) = v(1) - v(3);
Simplify →u(0) = x(0) + x(2); u(1) = x(0) – x(1);u(2) = x(1) + x(3); u(3) = x(1) – x(3);u(3) = i*u(3)y(0) = u(0) + u(2); y(2) = u(0) - u(2);y(1) = u(1) + u(3); y(3) = u(1) - u(3);
Carnegie Mellon, Drexel, UIUC
Formulas and Architectures
Parallel OperationAAAA
x y
Processor 0
Processor 1
Processor 2
Processor 3
Vector Operation
Task: Rewrite formulas to introducevarying amounts of vectorization and parallelismand modify access patterns
A4 A4 A4 A4
A4 A4 A4 A4
Carnegie Mellon, Drexel, UIUC
Multilinear Algebra & Block Recursive Algs
Matrix Multiplication
Strassen’s Algorithm
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL
SPIRAL (straightline code) SPL and Ruletrees
SPIRAL (loops)
SPIRAL (vector, parallel, and more)
SPIRAL (library generation)
Operator Language
Carnegie Mellon, Drexel, UIUC
16
SPL (Signal Processing Language) [PLDI 01] SPL expresses transform algorithms as structured sparse
matrix factorization
Examples:
SPL grammar in Backus-Naur form
Carnegie Mellon, Drexel, UIUC
17
Compiling SPL to Code Using Templates
for i=0..n-1for j=0..m-1
y[i+n*j]=x[m*i+j]
y[0:1:n-1] = call A(x[0:1:n-1])y[n:1:n+m-1] = call B(x[n:1:n+m-1])
for i=0..n-1y[im:1:im+m-1] = call B(x[im:1:im+m-1])
for i=0..n-1y[im:1:im+m-1] = call B(x[i:n:i+m-1])
Carnegie Mellon, Drexel, UIUC
18
2B 33 0F ...
...for (int j=0; j<=3; j++) {
y[j] = C1*x[j] + C2*x[j+4];y[j+4] = C1*x[j] – C2*x[j+4];
}...
...for (int j=0; j<=3; j++) {
y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];
} ...
SPIRAL Architecture [IEEE 05]
Formula Generator
SPL Compiler
Sear
ch E
ngin
e
AdaptedImplementation
Timer
C Compiler
0A FF C4 ...
100 ns80 ns
SPIR
AL
DFT_8.c80 ns
Approach: Empirical search over alternative recursive algorithms
Transform
Carnegie Mellon, Drexel, UIUC
DSP Algorithms: Transforms & Breakdown Rules
Combining these rules yields many algorithms for every given transform
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL
SPIRAL (straightline code)
SPIRAL (loops) ∑-SPL
SPIRAL (vector, parallel, and more)
SPIRAL (library generation)
Operator Language
Carnegie Mellon, Drexel, UIUC
21
void sub(double *y, double *x) {double t[8];for (int i=0; i<=7; i++)
t[(i/4)+2*(i%4)] = x[i];for (int i=0; i<4; i++){
y[2*i] = t[2*i] + t[2*i+1];y[2*i+1] = t[2*i] - t[2*i+1];
}}
Problem: Fusing Permutations and Loops
void sub(double *y, double *x) {for (int j=0; j<=3; j++){
y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];
}}
direct mapping
C compiler cannot do this
State-of-the-artSPIRAL: Hardcoded with templatesFFTW: Hardcoded in the infrastructure
Two passes over the working setComplex index computation
One pass over the working setSimple index computation
How does hardcoding scale?
Carnegie Mellon, Drexel, UIUC
22
New Approach for Loop Merging [PLDI 05]
SPL To Σ-SPL
Loop Merging
Index Simplification
SPL formula
Σ-SPL formula
Σ-SPL formula
Σ-SPL formula
Σ-SPL Compiler
Code
Formula Generator
SPL Compiler
Sear
ch E
ngin
e
Timer
C Compiler
AdaptedImplementation
Transform
New SPL Compiler
Carnegie Mellon, Drexel, UIUC
23
Σ−SPL Four central constructs: Σ, G, S, Perm Σ (sum) – makes loops explicit Gf (gather) – reads data using the index mapping f Sf (scatter) – writes data using the index mapping f Permf – permutes data using the index mapping f
Every Σ-SPL formula still represents a matrix factorization
Input Output
j=0
j=1
j=2
j=3
F2
Example:
Carnegie Mellon, Drexel, UIUC
24
Loop Merging With Rewriting Rules
SPL To Σ-SPL
Loop Merging
Index Simplification
SPL
Σ-SPL
Σ-SPL
Σ-SPL
Σ-SPL Compiler
Code
for (int j=0; j<=3; j++) {y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];
}
F2
XY
Rules:
F2
XY T
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL
SPIRAL (straightline code)
SPIRAL (loops)
SPIRAL (vector, parallel, and more) Tagged SPL
SPIRAL (library generation)
Operator Language
Carnegie Mellon, Drexel, UIUC
SPL to Shared Memory Code: Basic Idea [SC 06]
Governing construct: tensor product
p-way embarrassingly parallel, load-balanced
AAAA
x y
Processor 0
Processor 1
Processor 2
Processor 3
Problematic construct: permutations produce false sharing
Task: Rewrite formulas to extract tensor product + keep contiguous blocks
x y
Carnegie Mellon, Drexel, UIUC
Parallelization by Rewriting
Load-balancedNo false sharing
coarseplatform
model
Carnegie Mellon, Drexel, UIUC
Same Approach for Other Parallel ParadigmsVectorization [VecPar 06]Message Passing [ISPA 06]
Cg/OpenGL for GPUs Verilog for FPGAs [DAC 08]
MPI
Carnegie Mellon, Drexel, UIUC
Example Results
Multicore [SC 06]
Code written by the computer is faster (for many sizes) than any human-written code
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL
SPIRAL (straightline code)
SPIRAL (loops)
SPIRAL (vector, parallel, and more)
SPIRAL (library generation) From Rules to complete library infrastructure
Operator Language
Carnegie Mellon, Drexel, UIUC
FFTW
Tasks To Be Automatedblue = hand-written, red = generated
Library infrastructure
2 3 4 5 64326
CodeletsCodelet
generator
Recursive functions neededVectorization (partially)ThreadingAdaptation mechanism (plan)
Our GoalLibrary infrastructure
2 3 4 5 64326
Codelets (10–20 types)
Recursive functions neededVectorizationThreadingAdaptation mechanism (plan)
(10–20 types)
Spiral
Carnegie Mellon, Drexel, UIUC
How Library Generation Works [Vor 08]
Library Structure
Parallelization / Vectorization
Recursion Step Closure
High-performance library
Transforms + Breakdown rules
recursion step closureas Σ-SPL formulas
Library Target(FFTW, VSIPL, IPP FFT, ...)
Library Implementation
Build library plan
Hot/cold partition
Generate target code
32
Carnegie Mellon, Drexel, UIUC
Cooley-Tukey Fast Fourier Transform (FFT)
Breakdown Rules to Library Code
2 extra functions needed 33
void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;
Z = permute(X)
for i=0 to k-1dft_subvec(m, Z, Y, …)
for i=0 to n-1 Y[i] = Y[i]*T[i];
for i=0 to m-1 dft_strided(k, Y, Y, …)
}
DFT
Naive implementation
k=4
Carnegie Mellon, Drexel, UIUC
Cooley-Tukey Fast Fourier Transform (FFT)
Breakdown Rules to Library Code
void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;
for i=0 to k-1dft_strided2(m, X, Y, …)
for i=0 to m-1dft_strided3_scaled(k, Y, Y, T, …)
}
2 extra functions needed 34
void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;Z = permute(X)
for i=0 to k-1dft_subvec(m, Z, Y, …)
for i=0 to n-1 Y[i] = Y[i]*T[i];
for i=0 to m-1 dft_strided(k, Y, Y, …)
}
DFT
Optimized implementation Naive implementation
2 extra functions neededHow to discover these specialized variants automatically?
Carnegie Mellon, Drexel, UIUC
Computing Recursion Step Closure Input: transform T and a breakdown rule
Output: spawned recursion steps + Σ-SPL implementation
Algorithm:1. Apply the breakdown rule
2. Convert to Σ-SPL
3. Apply loop merging + index simplification rules.
4. Extract recursion steps
5. Repeat until closure is reached
35Parameterization (not shown) derives the independent parameter setfor each recursion step
Carnegie Mellon, Drexel, UIUC
Recursion Step Closure Examples
DFT (scalar)
DCT4 (vectorized)
36
4 mutually recursive functions- computed automatically- described using Σ-SPL formulas
17 mutually recursive functions
Carnegie Mellon, Drexel, UIUC
Benchmark: Complex DFT
0
2
4
6
8
10
12
14
16
4 16 64 256 1k 4k 16k 64k 256k 1Minput size
Complex DFT [Intel Core i7, 2.66 GHz, 4 cores]Performance, Gflop/s
Spiral, 4 threads
Intel IPP 6.0 Spiral, 1 thread
FFTW 3.2
Carnegie Mellon, Drexel, UIUC
Examples of Generated Libraries
• 2-way vectorized, 2-threaded• Most are faster than hand-written libs• Code size: 8–120 KLOC or 0.5–5 MB• Generation time: 1–3 hours
DFT
RDFT DHT
DCT2 DCT3 DCT4
Filter Wavelet
Total: 300 KLOC / 13.3 MB of code generated in < 20 hoursfrom a few simple algorithm specs
Intel IPP library 6.0 includes Spiral generated code 38
Carnegie Mellon, Drexel, UIUC
Outline
Introduction
PreSPIRAL
SPIRAL (straightline code)
SPIRAL (loops)
SPIRAL (vector, parallel, and more)
SPIRAL (library generation)
Operator Language Beyond transforms
Carnegie Mellon, Drexel, UIUC
Operators [DSL 09]
Definition Operator: Multiple complex vectors ! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear
Example: Matrix-matrix-multiplication (MMM)
A
B
C
Carnegie Mellon, Drexel, UIUC
Example: Matrix Multiplication (MMM)Breakdown rules:capture various forms of blocking
Carnegie Mellon, Drexel, UIUC
Matrix Multiplication Library
MKL 10.0
GotoBLAS 1.26
Spiral-generated library
MKL 10.0
GotoBLAS 1.26
Spiral-generatedlibrary
0123456789
2 4 8 16 32 64 128 256 512
performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz
Rank-k Update, double precision, k=4
Input size02468
1012141618
2 4 8 16 32 64 128 256 512
performance [Gflop/s] Dual Intel Xeon 5160, 3GhzRank-k Update, single precision, k=4
Input size
Spiral-generated library
MKL 10.0
Spiral-generated library
MKL 10.0
Carnegie Mellon, Drexel, UIUC
References
[CSSP 90] J. R. Johnson, R. W Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures, Circuits Systems Signal Process., (9)4: 449-500, 1990.
[JSC 91] R. W. Johnson, C. H. Huang and J. R. Johnson, Multilinear algebra and parallel programming, The Journal of Supercomputing, Vol. 5, Oct. 1991.
[IEEE 05] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", Vol. 93, No. 2, pp. 232- 275, 2005.
[IPDPS 02] Kang Chen and Jeremy Johnson, A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002.
[SC 06] Franz Franchetti, Yevgen Voronenko and Markus Püschel, FFT Program Generation for Shared Memory: SMP and Multicore, Proc. Supercomputing (SC), 2006.
[ISPA 06] Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber, Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers, Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006.
Carnegie Mellon, Drexel, UIUC
References [VecPar 08] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting
System for the Vectorization of Signal Transforms, Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006.
[DAC 08] Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel, Formal Datapath Representation and Manipulation for Implementing DSP Transforms, Proc. Design Automation Conference (DAC), 2008.
[IPDPS 04] Jeremy Johnson and Kang Chen, A Self-Adapting Distributed Memory Package for Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004.
[PLDI 01] Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua, SPL: A Language and Compiler for DSP Algorithms, Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001.
[PLDI 05] Franz Franchetti, Yevgen Voronenko and Markus Püschel, Formal Loop Merging for Signal Transforms, Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005.
[Vor 08] Yevgen Voronenko, Library Generation for Linear Transforms, PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008.
[DSL 09] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel, Operator Language: A Program Generation Framework for Fast Kernels, Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009,