spiral overview: automatic generation of dsp …jjohnson/2009-10/winter/cs650/lectures/... ·...

Carnegie Mellon, Drexel, UIUC

Jeremy Johnson (& SPIRAL Team)Drexel University

* The work is supported by DARPA DESA, NSF, and Intel.Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team.

SPIRAL Overview: Automatic Generation of DSP Algorithms & More*

CS 650 Program Generation and Opt.


Outline Introduction Challenge and goal

PreSPIRAL Formulas, programs, and rewrite rules

Multilinear Algebra and Block Recursive Algorithms

SPIRAL (straightline code) SPL and Ruletrees

SPIRAL (loops) ∑-SPL

SPIRAL (vector, parallel, and more) Tagged SPL

SPIRAL (library generation)

Operator Language Beyond transforms


Outline

Introduction Challenge and goal

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more)


Operator Language


Challenge of Obtaining Efficient Code

Memory hierarchy: 5x

Vector instructions: 3x

Multiple threads: 2x

High performance library development has become a nightmare


Spiral Overview Research Goal: “Teach” computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization

Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) … more

Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid

Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code


SPIRAL: Abstraction Levels

Transform

Ruletree

SPL

Σ-SPL

C Code…for (i=0; i<8; i++) {t[2i+4]=x[i-4]+x[i+4];… }

problem specification

easy manipulation for search

vector/parallel optimizations

loop optimizations

C compiler

trad

ition

alhu

man

dom

ain

tool

sav

aila

ble


Program Generation in Spiral (Sketched)Transformuser specified

C Code:

Fast algorithmin SPL many choices

∑-SPL:[PLDI 05]

Iteration of this process to search for the fastest

But that’s not all …

parallelizationvectorization

loop optimizations

constant foldingscheduling……

Optimization at allabstraction levels


Outline

Introduction

PreSPIRAL Formulas, programs, and rewrite rules [CSSP 90]

Multilinear Algebra and Block Recursive Algorithms [JSC 91]


SPIRAL (loops)



Operator Language


DSP Algorithms & Matrix Factorizations

−

−

−−

=

−−−−−−

1000001001000001

1100110000110011

000010000100001

1010010110100101

111111

111111

iii

ii

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Input vector

Output vector

input vector (signal)output vector (signal) transform = matrix


Rewrite Rules & Formula Manipulation


Rules and Dynamic Programming

DFT(16)

DFT(2) DFT(8)

DFT(16)

DFT(4) DFT(4)

Identity PermutationKronecker product

DFT(16)

DFT(8) DFT(2)

Diagonal matrix (twiddles)

Dynamic Programming (DP) searches over different applications of CT Rule using best transform found for recursive calls


Formulas to Code

t(0) = x(0); t(1) = x(2); t(2) = x(1); t(3) = x(3)u(0) = t(0) + t(1); u(1) = t(0) – t(1);u(2) = t(2) + t(3); u(3) = t(2) – t(3);v(0) = u(0); v(1) = u(1); v(2) = u(2); v(3) = i*u(3);y(0) = v(0) + v(2); y(2) = v(0) - v(2);y(1) = v(1) + v(3); y(3) = v(1) - v(3);

Simplify →u(0) = x(0) + x(2); u(1) = x(0) – x(1);u(2) = x(1) + x(3); u(3) = x(1) – x(3);u(3) = i*u(3)y(0) = u(0) + u(2); y(2) = u(0) - u(2);y(1) = u(1) + u(3); y(3) = u(1) - u(3);


Formulas and Architectures

Parallel OperationAAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Vector Operation

Task: Rewrite formulas to introducevarying amounts of vectorization and parallelismand modify access patterns

A4 A4 A4 A4

A4 A4 A4 A4


Multilinear Algebra & Block Recursive Algs

Matrix Multiplication

Strassen’s Algorithm


Outline

Introduction

PreSPIRAL

SPIRAL (straightline code) SPL and Ruletrees

SPIRAL (loops)



Operator Language


16

SPL (Signal Processing Language) [PLDI 01] SPL expresses transform algorithms as structured sparse

matrix factorization

Examples:

SPL grammar in Backus-Naur form


17

Compiling SPL to Code Using Templates

for i=0..n-1for j=0..m-1

y[i+n*j]=x[m*i+j]

y[0:1:n-1] = call A(x[0:1:n-1])y[n:1:n+m-1] = call B(x[n:1:n+m-1])

for i=0..n-1y[im:1:im+m-1] = call B(x[im:1:im+m-1])

for i=0..n-1y[im:1:im+m-1] = call B(x[i:n:i+m-1])


18

2B 33 0F ...

...for (int j=0; j<=3; j++) {

y[j] = C1*x[j] + C2*x[j+4];y[j+4] = C1*x[j] – C2*x[j+4];

}...

...for (int j=0; j<=3; j++) {

y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

} ...

SPIRAL Architecture [IEEE 05]

Formula Generator

SPL Compiler

Sear

ch E

ngin

e

AdaptedImplementation

Timer

C Compiler

0A FF C4 ...

100 ns80 ns

SPIR

AL

DFT_8.c80 ns

Approach: Empirical search over alternative recursive algorithms

Transform


DSP Algorithms: Transforms & Breakdown Rules

Combining these rules yields many algorithms for every given transform


Outline

Introduction

PreSPIRAL


SPIRAL (loops) ∑-SPL



Operator Language


21

void sub(double *y, double *x) {double t[8];for (int i=0; i<=7; i++)

t[(i/4)+2*(i%4)] = x[i];for (int i=0; i<4; i++){

y[2*i] = t[2*i] + t[2*i+1];y[2*i+1] = t[2*i] - t[2*i+1];

}}

Problem: Fusing Permutations and Loops

void sub(double *y, double *x) {for (int j=0; j<=3; j++){

y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

}}

direct mapping

C compiler cannot do this

State-of-the-artSPIRAL: Hardcoded with templatesFFTW: Hardcoded in the infrastructure

Two passes over the working setComplex index computation

One pass over the working setSimple index computation

How does hardcoding scale?


22

New Approach for Loop Merging [PLDI 05]

SPL To Σ-SPL

Loop Merging

Index Simplification

SPL formula

Σ-SPL formula

Σ-SPL formula

Σ-SPL formula

Σ-SPL Compiler

Code

Formula Generator

SPL Compiler

Sear

ch E

ngin

e

Timer

C Compiler

AdaptedImplementation

Transform

New SPL Compiler


23

Σ−SPL Four central constructs: Σ, G, S, Perm Σ (sum) – makes loops explicit Gf (gather) – reads data using the index mapping f Sf (scatter) – writes data using the index mapping f Permf – permutes data using the index mapping f

Every Σ-SPL formula still represents a matrix factorization

Input Output

j=0

j=1

j=2

j=3

F2

Example:


24

Loop Merging With Rewriting Rules

SPL To Σ-SPL

Loop Merging

Index Simplification

SPL

Σ-SPL

Σ-SPL

Σ-SPL

Σ-SPL Compiler

Code

for (int j=0; j<=3; j++) {y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

}

F2

XY

Rules:

F2

XY T


Outline

Introduction

PreSPIRAL


SPIRAL (loops)

SPIRAL (vector, parallel, and more) Tagged SPL


Operator Language


SPL to Shared Memory Code: Basic Idea [SC 06]

Governing construct: tensor product

p-way embarrassingly parallel, load-balanced

AAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Problematic construct: permutations produce false sharing

Task: Rewrite formulas to extract tensor product + keep contiguous blocks

x y


Parallelization by Rewriting

Load-balancedNo false sharing

coarseplatform

model


Same Approach for Other Parallel ParadigmsVectorization [VecPar 06]Message Passing [ISPA 06]

Cg/OpenGL for GPUs Verilog for FPGAs [DAC 08]

MPI


Example Results

Multicore [SC 06]

Code written by the computer is faster (for many sizes) than any human-written code


Outline

Introduction

PreSPIRAL


SPIRAL (loops)


SPIRAL (library generation) From Rules to complete library infrastructure

Operator Language


FFTW

Tasks To Be Automatedblue = hand-written, red = generated

Library infrastructure

2 3 4 5 64326

CodeletsCodelet

generator

Recursive functions neededVectorization (partially)ThreadingAdaptation mechanism (plan)

Our GoalLibrary infrastructure

2 3 4 5 64326

Codelets (10–20 types)

Recursive functions neededVectorizationThreadingAdaptation mechanism (plan)

(10–20 types)

Spiral


How Library Generation Works [Vor 08]

Library Structure

Parallelization / Vectorization

Recursion Step Closure

High-performance library

Transforms + Breakdown rules

recursion step closureas Σ-SPL formulas

Library Target(FFTW, VSIPL, IPP FFT, ...)

Library Implementation

Build library plan

Hot/cold partition

Generate target code

32


Cooley-Tukey Fast Fourier Transform (FFT)

Breakdown Rules to Library Code

2 extra functions needed 33

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;

Z = permute(X)

for i=0 to k-1dft_subvec(m, Z, Y, …)

for i=0 to n-1 Y[i] = Y[i]*T[i];

for i=0 to m-1 dft_strided(k, Y, Y, …)

}

DFT

Naive implementation

k=4


Cooley-Tukey Fast Fourier Transform (FFT)

Breakdown Rules to Library Code

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;

for i=0 to k-1dft_strided2(m, X, Y, …)

for i=0 to m-1dft_strided3_scaled(k, Y, Y, T, …)

}

2 extra functions needed 34

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;Z = permute(X)

for i=0 to k-1dft_subvec(m, Z, Y, …)

for i=0 to n-1 Y[i] = Y[i]*T[i];

for i=0 to m-1 dft_strided(k, Y, Y, …)

}

DFT

Optimized implementation Naive implementation

2 extra functions neededHow to discover these specialized variants automatically?


Computing Recursion Step Closure Input: transform T and a breakdown rule

Output: spawned recursion steps + Σ-SPL implementation

Algorithm:1. Apply the breakdown rule

2. Convert to Σ-SPL

3. Apply loop merging + index simplification rules.

4. Extract recursion steps

5. Repeat until closure is reached

35Parameterization (not shown) derives the independent parameter setfor each recursion step


Recursion Step Closure Examples

DFT (scalar)

DCT4 (vectorized)

36

4 mutually recursive functions- computed automatically- described using Σ-SPL formulas

17 mutually recursive functions


Benchmark: Complex DFT

0

2

4

6

8

10

12

14

16

4 16 64 256 1k 4k 16k 64k 256k 1Minput size

Complex DFT [Intel Core i7, 2.66 GHz, 4 cores]Performance, Gflop/s

Spiral, 4 threads

Intel IPP 6.0 Spiral, 1 thread

FFTW 3.2


Examples of Generated Libraries

• 2-way vectorized, 2-threaded• Most are faster than hand-written libs• Code size: 8–120 KLOC or 0.5–5 MB• Generation time: 1–3 hours

DFT

RDFT DHT

DCT2 DCT3 DCT4

Filter Wavelet

Total: 300 KLOC / 13.3 MB of code generated in < 20 hoursfrom a few simple algorithm specs

Intel IPP library 6.0 includes Spiral generated code 38


Outline

Introduction

PreSPIRAL


SPIRAL (loops)



Operator Language Beyond transforms


Operators [DSL 09]

Definition Operator: Multiple complex vectors ! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear

Example: Matrix-matrix-multiplication (MMM)

A

B

C


Operator Language


Example: Matrix Multiplication (MMM)Breakdown rules:capture various forms of blocking


Matrix Multiplication Library

MKL 10.0

GotoBLAS 1.26

Spiral-generated library

MKL 10.0

GotoBLAS 1.26

Spiral-generatedlibrary

0123456789

2 4 8 16 32 64 128 256 512

performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz

Rank-k Update, double precision, k=4

Input size02468

1012141618

2 4 8 16 32 64 128 256 512

performance [Gflop/s] Dual Intel Xeon 5160, 3GhzRank-k Update, single precision, k=4

Input size


MKL 10.0


MKL 10.0


References

[CSSP 90] J. R. Johnson, R. W Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures, Circuits Systems Signal Process., (9)4: 449-500, 1990.

[JSC 91] R. W. Johnson, C. H. Huang and J. R. Johnson, Multilinear algebra and parallel programming, The Journal of Supercomputing, Vol. 5, Oct. 1991.

[IEEE 05] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", Vol. 93, No. 2, pp. 232- 275, 2005.

[IPDPS 02] Kang Chen and Jeremy Johnson, A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002.

[SC 06] Franz Franchetti, Yevgen Voronenko and Markus Püschel, FFT Program Generation for Shared Memory: SMP and Multicore, Proc. Supercomputing (SC), 2006.

[ISPA 06] Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber, Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers, Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006.


References [VecPar 08] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting

System for the Vectorization of Signal Transforms, Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006.

[DAC 08] Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel, Formal Datapath Representation and Manipulation for Implementing DSP Transforms, Proc. Design Automation Conference (DAC), 2008.

[IPDPS 04] Jeremy Johnson and Kang Chen, A Self-Adapting Distributed Memory Package for Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004.

[PLDI 01] Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua, SPL: A Language and Compiler for DSP Algorithms, Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001.

[PLDI 05] Franz Franchetti, Yevgen Voronenko and Markus Püschel, Formal Loop Merging for Signal Transforms, Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005.

[Vor 08] Yevgen Voronenko, Library Generation for Linear Transforms, PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008.

[DSL 09] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel, Operator Language: A Program Generation Framework for Fast Kernels, Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009,

spiral overview: automatic generation of dsp …jjohnson/2009-10/winter/cs650/lectures/... ·...

Documents