spiral overview: automatic generation of dsp …jjohnson/2009-10/winter/cs650/lectures/... ·...

45
Carnegie Mellon, Drexel, UIUC Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team. SPIRAL Overview: Automatic Generation of DSP Algorithms & More * CS 650 Program Generation and Opt.

Upload: hoangtuyen

Post on 24-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Carnegie Mellon, Drexel, UIUC

Jeremy Johnson (& SPIRAL Team)Drexel University

* The work is supported by DARPA DESA, NSF, and Intel.Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team.

SPIRAL Overview: Automatic Generation of DSP Algorithms & More*

CS 650 Program Generation and Opt.

Carnegie Mellon, Drexel, UIUC

Outline Introduction Challenge and goal

PreSPIRAL Formulas, programs, and rewrite rules

Multilinear Algebra and Block Recursive Algorithms

SPIRAL (straightline code) SPL and Ruletrees

SPIRAL (loops) ∑-SPL

SPIRAL (vector, parallel, and more) Tagged SPL

SPIRAL (library generation)

Operator Language Beyond transforms

Carnegie Mellon, Drexel, UIUC

Outline

Introduction Challenge and goal

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more)

SPIRAL (library generation)

Operator Language

Carnegie Mellon, Drexel, UIUC

Challenge of Obtaining Efficient Code

Memory hierarchy: 5x

Vector instructions: 3x

Multiple threads: 2x

High performance library development has become a nightmare

Carnegie Mellon, Drexel, UIUC

Spiral Overview Research Goal: “Teach” computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization

Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) … more

Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid

Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code

Carnegie Mellon, Drexel, UIUC

SPIRAL: Abstraction Levels

Transform

Ruletree

SPL

Σ-SPL

C Code…for (i=0; i<8; i++) {t[2i+4]=x[i-4]+x[i+4];… }

problem specification

easy manipulation for search

vector/parallel optimizations

loop optimizations

C compiler

trad

ition

alhu

man

dom

ain

tool

sav

aila

ble

Carnegie Mellon, Drexel, UIUC

Program Generation in Spiral (Sketched)Transformuser specified

C Code:

Fast algorithmin SPL many choices

∑-SPL:[PLDI 05]

Iteration of this process to search for the fastest

But that’s not all …

parallelizationvectorization

loop optimizations

constant foldingscheduling……

Optimization at allabstraction levels

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL Formulas, programs, and rewrite rules [CSSP 90]

Multilinear Algebra and Block Recursive Algorithms [JSC 91]

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more)

SPIRAL (library generation)

Operator Language

Carnegie Mellon, Drexel, UIUC

DSP Algorithms & Matrix Factorizations

−−

=

−−−−−−

1000001001000001

1100110000110011

000010000100001

1010010110100101

111111

111111

iii

ii

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Input vector

Output vector

input vector (signal)output vector (signal) transform = matrix

Carnegie Mellon, Drexel, UIUC

Rewrite Rules & Formula Manipulation

Carnegie Mellon, Drexel, UIUC

Rules and Dynamic Programming

DFT(16)

DFT(2) DFT(8)

DFT(16)

DFT(4) DFT(4)

Identity PermutationKronecker product

DFT(16)

DFT(8) DFT(2)

Diagonal matrix (twiddles)

Dynamic Programming (DP) searches over different applications of CT Rule using best transform found for recursive calls

Carnegie Mellon, Drexel, UIUC

Formulas to Code

t(0) = x(0); t(1) = x(2); t(2) = x(1); t(3) = x(3)u(0) = t(0) + t(1); u(1) = t(0) – t(1);u(2) = t(2) + t(3); u(3) = t(2) – t(3);v(0) = u(0); v(1) = u(1); v(2) = u(2); v(3) = i*u(3);y(0) = v(0) + v(2); y(2) = v(0) - v(2);y(1) = v(1) + v(3); y(3) = v(1) - v(3);

Simplify →u(0) = x(0) + x(2); u(1) = x(0) – x(1);u(2) = x(1) + x(3); u(3) = x(1) – x(3);u(3) = i*u(3)y(0) = u(0) + u(2); y(2) = u(0) - u(2);y(1) = u(1) + u(3); y(3) = u(1) - u(3);

Carnegie Mellon, Drexel, UIUC

Formulas and Architectures

Parallel OperationAAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Vector Operation

Task: Rewrite formulas to introducevarying amounts of vectorization and parallelismand modify access patterns

A4 A4 A4 A4

A4 A4 A4 A4

Carnegie Mellon, Drexel, UIUC

Multilinear Algebra & Block Recursive Algs

Matrix Multiplication

Strassen’s Algorithm

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL

SPIRAL (straightline code) SPL and Ruletrees

SPIRAL (loops)

SPIRAL (vector, parallel, and more)

SPIRAL (library generation)

Operator Language

Carnegie Mellon, Drexel, UIUC

16

SPL (Signal Processing Language) [PLDI 01] SPL expresses transform algorithms as structured sparse

matrix factorization

Examples:

SPL grammar in Backus-Naur form

Carnegie Mellon, Drexel, UIUC

17

Compiling SPL to Code Using Templates

for i=0..n-1for j=0..m-1

y[i+n*j]=x[m*i+j]

y[0:1:n-1] = call A(x[0:1:n-1])y[n:1:n+m-1] = call B(x[n:1:n+m-1])

for i=0..n-1y[im:1:im+m-1] = call B(x[im:1:im+m-1])

for i=0..n-1y[im:1:im+m-1] = call B(x[i:n:i+m-1])

Carnegie Mellon, Drexel, UIUC

18

2B 33 0F ...

...for (int j=0; j<=3; j++) {

y[j] = C1*x[j] + C2*x[j+4];y[j+4] = C1*x[j] – C2*x[j+4];

}...

...for (int j=0; j<=3; j++) {

y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

} ...

SPIRAL Architecture [IEEE 05]

Formula Generator

SPL Compiler

Sear

ch E

ngin

e

AdaptedImplementation

Timer

C Compiler

0A FF C4 ...

100 ns80 ns

SPIR

AL

DFT_8.c80 ns

Approach: Empirical search over alternative recursive algorithms

Transform

Carnegie Mellon, Drexel, UIUC

DSP Algorithms: Transforms & Breakdown Rules

Combining these rules yields many algorithms for every given transform

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops) ∑-SPL

SPIRAL (vector, parallel, and more)

SPIRAL (library generation)

Operator Language

Carnegie Mellon, Drexel, UIUC

21

void sub(double *y, double *x) {double t[8];for (int i=0; i<=7; i++)

t[(i/4)+2*(i%4)] = x[i];for (int i=0; i<4; i++){

y[2*i] = t[2*i] + t[2*i+1];y[2*i+1] = t[2*i] - t[2*i+1];

}}

Problem: Fusing Permutations and Loops

void sub(double *y, double *x) {for (int j=0; j<=3; j++){

y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

}}

direct mapping

C compiler cannot do this

State-of-the-artSPIRAL: Hardcoded with templatesFFTW: Hardcoded in the infrastructure

Two passes over the working setComplex index computation

One pass over the working setSimple index computation

How does hardcoding scale?

Carnegie Mellon, Drexel, UIUC

22

New Approach for Loop Merging [PLDI 05]

SPL To Σ-SPL

Loop Merging

Index Simplification

SPL formula

Σ-SPL formula

Σ-SPL formula

Σ-SPL formula

Σ-SPL Compiler

Code

Formula Generator

SPL Compiler

Sear

ch E

ngin

e

Timer

C Compiler

AdaptedImplementation

Transform

New SPL Compiler

Carnegie Mellon, Drexel, UIUC

23

Σ−SPL Four central constructs: Σ, G, S, Perm Σ (sum) – makes loops explicit Gf (gather) – reads data using the index mapping f Sf (scatter) – writes data using the index mapping f Permf – permutes data using the index mapping f

Every Σ-SPL formula still represents a matrix factorization

Input Output

j=0

j=1

j=2

j=3

F2

Example:

Carnegie Mellon, Drexel, UIUC

24

Loop Merging With Rewriting Rules

SPL To Σ-SPL

Loop Merging

Index Simplification

SPL

Σ-SPL

Σ-SPL

Σ-SPL

Σ-SPL Compiler

Code

for (int j=0; j<=3; j++) {y[2*j] = x[j] + x[j+4];y[2*j+1] = x[j] - x[j+4];

}

F2

XY

Rules:

F2

XY T

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more) Tagged SPL

SPIRAL (library generation)

Operator Language

Carnegie Mellon, Drexel, UIUC

SPL to Shared Memory Code: Basic Idea [SC 06]

Governing construct: tensor product

p-way embarrassingly parallel, load-balanced

AAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Problematic construct: permutations produce false sharing

Task: Rewrite formulas to extract tensor product + keep contiguous blocks

x y

Carnegie Mellon, Drexel, UIUC

Parallelization by Rewriting

Load-balancedNo false sharing

coarseplatform

model

Carnegie Mellon, Drexel, UIUC

Same Approach for Other Parallel ParadigmsVectorization [VecPar 06]Message Passing [ISPA 06]

Cg/OpenGL for GPUs Verilog for FPGAs [DAC 08]

MPI

Carnegie Mellon, Drexel, UIUC

Example Results

Multicore [SC 06]

Code written by the computer is faster (for many sizes) than any human-written code

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more)

SPIRAL (library generation) From Rules to complete library infrastructure

Operator Language

Carnegie Mellon, Drexel, UIUC

FFTW

Tasks To Be Automatedblue = hand-written, red = generated

Library infrastructure

2 3 4 5 64326

CodeletsCodelet

generator

Recursive functions neededVectorization (partially)ThreadingAdaptation mechanism (plan)

Our GoalLibrary infrastructure

2 3 4 5 64326

Codelets (10–20 types)

Recursive functions neededVectorizationThreadingAdaptation mechanism (plan)

(10–20 types)

Spiral

Carnegie Mellon, Drexel, UIUC

How Library Generation Works [Vor 08]

Library Structure

Parallelization / Vectorization

Recursion Step Closure

High-performance library

Transforms + Breakdown rules

recursion step closureas Σ-SPL formulas

Library Target(FFTW, VSIPL, IPP FFT, ...)

Library Implementation

Build library plan

Hot/cold partition

Generate target code

32

Carnegie Mellon, Drexel, UIUC

Cooley-Tukey Fast Fourier Transform (FFT)

Breakdown Rules to Library Code

2 extra functions needed 33

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;

Z = permute(X)

for i=0 to k-1dft_subvec(m, Z, Y, …)

for i=0 to n-1 Y[i] = Y[i]*T[i];

for i=0 to m-1 dft_strided(k, Y, Y, …)

}

DFT

Naive implementation

k=4

Carnegie Mellon, Drexel, UIUC

Cooley-Tukey Fast Fourier Transform (FFT)

Breakdown Rules to Library Code

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;

for i=0 to k-1dft_strided2(m, X, Y, …)

for i=0 to m-1dft_strided3_scaled(k, Y, Y, T, …)

}

2 extra functions needed 34

void dft(int n, cplx X[], cplx Y[]) {k = choose_factor(n); m = n/k;Z = permute(X)

for i=0 to k-1dft_subvec(m, Z, Y, …)

for i=0 to n-1 Y[i] = Y[i]*T[i];

for i=0 to m-1 dft_strided(k, Y, Y, …)

}

DFT

Optimized implementation Naive implementation

2 extra functions neededHow to discover these specialized variants automatically?

Carnegie Mellon, Drexel, UIUC

Computing Recursion Step Closure Input: transform T and a breakdown rule

Output: spawned recursion steps + Σ-SPL implementation

Algorithm:1. Apply the breakdown rule

2. Convert to Σ-SPL

3. Apply loop merging + index simplification rules.

4. Extract recursion steps

5. Repeat until closure is reached

35Parameterization (not shown) derives the independent parameter setfor each recursion step

Carnegie Mellon, Drexel, UIUC

Recursion Step Closure Examples

DFT (scalar)

DCT4 (vectorized)

36

4 mutually recursive functions- computed automatically- described using Σ-SPL formulas

17 mutually recursive functions

Carnegie Mellon, Drexel, UIUC

Benchmark: Complex DFT

0

2

4

6

8

10

12

14

16

4 16 64 256 1k 4k 16k 64k 256k 1Minput size

Complex DFT [Intel Core i7, 2.66 GHz, 4 cores]Performance, Gflop/s

Spiral, 4 threads

Intel IPP 6.0 Spiral, 1 thread

FFTW 3.2

Carnegie Mellon, Drexel, UIUC

Examples of Generated Libraries

• 2-way vectorized, 2-threaded• Most are faster than hand-written libs• Code size: 8–120 KLOC or 0.5–5 MB• Generation time: 1–3 hours

DFT

RDFT DHT

DCT2 DCT3 DCT4

Filter Wavelet

Total: 300 KLOC / 13.3 MB of code generated in < 20 hoursfrom a few simple algorithm specs

Intel IPP library 6.0 includes Spiral generated code 38

Carnegie Mellon, Drexel, UIUC

Outline

Introduction

PreSPIRAL

SPIRAL (straightline code)

SPIRAL (loops)

SPIRAL (vector, parallel, and more)

SPIRAL (library generation)

Operator Language Beyond transforms

Carnegie Mellon, Drexel, UIUC

Operators [DSL 09]

Definition Operator: Multiple complex vectors ! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear

Example: Matrix-matrix-multiplication (MMM)

A

B

C

Carnegie Mellon, Drexel, UIUC

Operator Language

Carnegie Mellon, Drexel, UIUC

Example: Matrix Multiplication (MMM)Breakdown rules:capture various forms of blocking

Carnegie Mellon, Drexel, UIUC

Matrix Multiplication Library

MKL 10.0

GotoBLAS 1.26

Spiral-generated library

MKL 10.0

GotoBLAS 1.26

Spiral-generatedlibrary

0123456789

2 4 8 16 32 64 128 256 512

performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz

Rank-k Update, double precision, k=4

Input size02468

1012141618

2 4 8 16 32 64 128 256 512

performance [Gflop/s] Dual Intel Xeon 5160, 3GhzRank-k Update, single precision, k=4

Input size

Spiral-generated library

MKL 10.0

Spiral-generated library

MKL 10.0

Carnegie Mellon, Drexel, UIUC

References

[CSSP 90] J. R. Johnson, R. W Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures, Circuits Systems Signal Process., (9)4: 449-500, 1990.

[JSC 91] R. W. Johnson, C. H. Huang and J. R. Johnson, Multilinear algebra and parallel programming, The Journal of Supercomputing, Vol. 5, Oct. 1991.

[IEEE 05] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", Vol. 93, No. 2, pp. 232- 275, 2005.

[IPDPS 02] Kang Chen and Jeremy Johnson, A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002.

[SC 06] Franz Franchetti, Yevgen Voronenko and Markus Püschel, FFT Program Generation for Shared Memory: SMP and Multicore, Proc. Supercomputing (SC), 2006.

[ISPA 06] Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber, Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers, Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006.

Carnegie Mellon, Drexel, UIUC

References [VecPar 08] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting

System for the Vectorization of Signal Transforms, Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006.

[DAC 08] Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel, Formal Datapath Representation and Manipulation for Implementing DSP Transforms, Proc. Design Automation Conference (DAC), 2008.

[IPDPS 04] Jeremy Johnson and Kang Chen, A Self-Adapting Distributed Memory Package for Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004.

[PLDI 01] Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua, SPL: A Language and Compiler for DSP Algorithms, Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001.

[PLDI 05] Franz Franchetti, Yevgen Voronenko and Markus Püschel, Formal Loop Merging for Signal Transforms, Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005.

[Vor 08] Yevgen Voronenko, Library Generation for Linear Transforms, PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008.

[DSL 09] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel, Operator Language: A Program Generation Framework for Fast Kernels, Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009,