can we teach computers to write fast...

Carnegie Mellon

Can We Teach Computers

To Write Fast Libraries?

This work was supported by

DARPA DESA program, NSF-NGS/ITR, NSF-ACR, NSF-CPA, and Intel

Markus Püschel

Collaborators:

Franz Franchetti

Yevgen Voronenko

Electrical and

Computer Engineering

Carnegie Mellon University

… and the Spiral team (only part shown)

Carnegie Mellon

Hiroshige, Temple in park near Osaka, ca. 1853-59

Carnegie Mellon

The Problem: Example DFT

Standard desktop computer, vendor compiler, using optimization flags

All implementations have roughly the same operations count (~ 4nlog2(n))

Maybe the DFT is just difficult?

0

5

10

15

20

25

30

16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144

input size

..

Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single precision)

Performance [Gflop/s]

12x

30x

Numerical recipes

Best code

Carnegie Mellon

The Problem: Example MMM

Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, …

What’s going on?

0

5

10

15

20

25

30

35

40

45

50

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000

matrix size

Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double precision)Performance [Gflop/s]

160x

Triple loop

Best code (K. Goto)

Carnegie Mellon

Evolution of Processors (Intel)

Carnegie Mellon

Evolution of Processors (Intel)

Era of

parallelism

High performance library development becomes increasingly difficult

Carnegie Mellon

DFT Plot: Analysis

0

5

10

15

20

25

30

16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144

input size

..

Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz

Gflop/s

Memory hierarchy: 5x

Vector instructions: 3x

Multiple threads: 2x

High performance library development has become a nightmare

Carnegie Mellon

Current Solution

Legions of programmers implement and optimize the same

functionality for every platform and whenever a new platform

comes out.

Carnegie Mellon

Better Solution: Automatic Performance Tuning

Automate (parts of) the implementation or optimization

Research efforts Linear algebra: Phipac/ATLAS (UTK),

Sparsity/Bebop (Berkeley), Flame (UT Austin)

Tensor computations (Ohio State)

PDE/finite elements: Fenics

Adaptive sorting (UIUC)

Fourier transform: FFTW (MIT)

Linear transforms: Spiral

…others

New compiler techniques

Proceedings of the IEEE special issue, Feb. 2005

Promising new area but more work needed

In particular for parallelism …

Carnegie Mellon

Organization

Spiral: Brief overview

Parallelization in Spiral

Results

Conclusion

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan

Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W.

Johnson, and Nick Rizzolo, SPIRAL: Code Generation for DSP Transforms,

Proceedings of the IEEE 93(2), 2005

Carnegie Mellon

Vision Behind Spiral

Numerical problem

Computing platform

algorithm selection

compilation

hu

ma

n e

ffo

rta

uto

ma

ted

implementation

C program

au

tom

ate

d

algorithm selection

compilation

implementation

Numerical problem

Computing platform

Current Future

C code a singularity: Compiler has

no access to high level information

Challenge: conquer the high abstraction

level for complete automation

Carnegie Mellon

Spiral Library generator for linear transforms

(DFT, DCT, DWT, filters, ….) and recently more …

Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog

Research Goal: “Teach” computers to write fast libraries Complete automation of implementation and optimization Conquer the “high” algorithm level for automation

When a new platform comes out: Regenerate a retuned library

When a new platform paradigm comes out (e.g., vector or CMPs):Update the tool rather than rewriting the library

Intel has started to use Spiral to generate parts of their MKL/IPP library

Carnegie Mellon

How Spiral Works

Algorithm Generation

Algorithm Optimization

Implementation

Code Optimization

Compilation

Compiler Optimizations

Problem specification (transform)

algorithm

C code

Fast executable

performance

Sea

rch

controls

controls

Spiral

Spiral:

Complete automation of the

implementation and

optimization task

Basic ideas:

Declarative representation

of algorithms

Rewriting systems to

generate and optimize

algorithms at a high level

of abstraction

Carnegie Mellon

Web Interface (Beta Version, @spiral.net)

http://www.spiral.net/

Carnegie Mellon

What is a (Linear) Transform?

Mathematically: Matrix-vector multiplication

Example: Discrete Fourier transform (DFT)

input vector (signal)

output vector (signal) transform = matrix

Carnegie Mellon

Transform Algorithms: Example 4-point FFT

Cooley/Tukey fast Fourier transform (FFT):

Algorithms are divide-and-conquer: Breakdown rules

Mathematical, declarative representation: SPL (signal processing language)

SPL describes the structure of the dataflow

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1

j j

j j j

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Carnegie Mellon

Examples: Transforms (currently 55)

Carnegie Mellon

Examples: Breakdown Rules (currently ≈220)

Base case rules

Combining these rules yields many algorithms for every given transform

Carnegie Mellon

Breakdown Rules (220)BSkewDFT 4

BSkewDFT_Base2

BSkewDFT_Base4

BSkewDFT_Fact

BSkewDFT_Decomp

BSkewPRDFT 1

BSkewPRDFT_Decomp

BSkewRDFT 1

BSkewPRDFT_Decomp

Circulant 10

RuleCirculant_Base

RuleCirculant_toFilt

RuleCirculant_Blocking

RuleCirculant_BlockingDense

RuleCirculant_DiagonalizeStep

RuleCirculant_DFT

RuleCirculant_RDFT

RuleCirculant_PRDFT

RuleCirculant_PRDFT1

RuleCirculant_DHT

CosDFT 2

CosDFT_Base

CosDFT_Trigonometric

DCT1 3

DCT1_Base2

DCT1_Base4

DCT1_DCT1and3

DCT2 5

DCT2_DCT2and4

DCT2_toRDFT

DCT2_toRDFT_odd

DCT2_PrimePowerInduction

DCT2_PrimeFactor

DCT3 5

DCT3_Base2

DCT3_Base3

DCT3_Base5

DCT3_Orth9

DCT3_toSkewDCT3

DCT4 8

DCT4_Base2

DCT4_Base3

DCT4_DCT2

DCT4_DCT2t

DCT4_DCT2andDST2

DCT4_DST4andDST2

DCT4_Iterative

DCT4_BSkew_Decomp

PolyDTT 4

PolyDTT_ToNormal

PolyDTT_Base2

PolyDTT_SkewBase2

PolyDTT_SkewDST3_CT

RDFT 4

RDFT_Base

RDFT_Trigonometric

RDFT_CT_Radix2

RDFT_toDCT2

RHT 2

RHT_Base

RHT_CooleyTukey

SinDFT 2

SinDFT_Base

SinDFT_Trigonometric

SkewDFT 2

SkewDFT_Base2

SkewDFT_Fact

SkewDTT 3

SkewDTT_Base2

SkewDCT3_VarSteidl

SkewDCT3_VarSteidlIterative

Toeplitz 6

RuleToeplitz_Base

RuleToeplitz_SmartBase

RuleToeplitz_PreciseBase

RuleToeplitz_Blocking

RuleToeplitz_BlockingDense

RuleToeplitz_ExpandedConvolution

UDFT_Base4

UDFT_SR_Reg

URDFT 4

URDFT1_Base1

URDFT1_Base2

URDFT1_Base4

URDFT1_CT

WHT 8

WHT_Base

WHT_GeneralSplit

WHT_BinSplit

WHT_DDL

WHT_Dirsum

WHT_tSPL_BinSplit

WHT_tSPL_STerm

WHT_tSPL_Pease

PDHT2 2

PDHT2_Base2

PDHT2_CT

PDHT3 4

PDHT3_Base2

PDHT3_CT

PDHT3_Trig

PDHT3_CT_Radix2

PDHT4 3

PDHT4_Base2

PDHT4_CT

PDHT4_Trig

PDST4 2

PDST4_Base2

PDST4_CT

PRDFT 10

PRDFT1_Base1

PRDFT1_Base2

PRDFT1_CT

PRDFT1_Complex

PRDFT1_Complex_T

PRDFT1_Trig

PRDFT1_PF

PRDFT1_CT_Radix2

PRDFT_PD

PRDFT_Rader

PRDFT2 4

PRDFT2_Base1

PRDFT2_Base2

PRDFT2_CT

PRDFT2_CT_Radix2

PRDFT3 6

PRDFT3_Base1

PRDFT3_Base2

PRDFT3_CT

PRDFT3_Trig

PRDFT3_OddToPRDFT1

PRDFT3_CT_Radix2

PRDFT4 4

PRDFT4_Base1

PRDFT4_Base2

PRDFT4_CT

PRDFT4_Trig

PolyBDFT 5

PolyBDFT_Base2

PolyBDFT_Base4

PolyBDFT_Fact

PolyBDFT_Trig

PolyBDFT_Decomp

Filt 10

RuleFilt_Base

RuleFilt_Nest

RuleFilt_OverlapSave

RuleFilt_OverlapSaveFreq

RuleFilt_OverlapAdd

RuleFilt_OverlapAdd2

RuleFilt_KaratsubaSimple

RuleFilt_Circulant

RuleFilt_Blocking

RuleFilt_KaratsubaFast

IPRDFT 5

IPRDFT1_Base1

IPRDFT1_Base2

IPRDFT1_CT

IPRDFT1_Complex

IPRDFT1_CT_Radix2

IPRDFT2 4

IPRDFT2_Base1

IPRDFT2_Base2

IPRDFT2_CT

IPRDFT2_CT_Radix2

IPRDFT3 3

IPRDFT3_Base1

IPRDFT3_Base2

IPRDFT3_CT

IPRDFT4 3

IPRDFT4_Base1

IPRDFT4_Base2

IPRDFT4_CT

IRDFT 1

IRDFT_Trigonometric

InverseDCT 1

InverseDCT_toDCT3

MDCT 1

MDCT_toDCT4

MDDFT 5

MDDFT_Base

MDDFT_Tensor

MDDFT_RowCol

MDDFT_Dimless

MDDFT_tSPL_RowCol

PDCT4 2

PDCT4_Base2

PDCT4_CT

PDHT 3

PDHT1_Base2

PDHT1_CT

PDHT1_CT_Radix2

DRDFT 1

DRDFT_tSPL_Pease

DSCirculant 2

RuleDSCirculant_Base

RuleDSCirculant_toFilt

DST1 3

DST1_Base1

DST1_Base2

DST1_DST3and1

DST2 3

DST2_Base2

DST2_toDCT2

DST2_DST2and4

DST4 2

DST4_Base

DST4_toDCT4

DTT 15

DTT_C3_Base2

DTT_IC3_Base2

DTT_C3_Base3

DTT_C3_Base3

DTT_S3_Base2

DTT_S3Poly_Base2

DTT_S3Poly_Base3

DTT_S3_Base2

DTT_C4_Base2

DTT_C4_Base2_LS

DTT_C4Poly_Base2

DTT_S4_Base2

DTT_S4_Base2_LS

DTT_S4Poly_Base2

DTT_DIF_T

DWT 5

DWT_Filt

DWT_Base2

DWT_Base4

DWT_Base8

DWT_Lifting

DWTper 4

DWTper_Mallat_2

DWTper_Mallat

DWTper_Polyphase

DWTper_Lifting

DCT5 1

DCT5_Rader

DFT 22

DFT_Base

DFT_Canonize

DFT_CT

DFT_CT_Mincost

DFT_CT_DDL

DFT_CosSinSplit

DFT_Rader

DFT_Bluestein

DFT_GoodThomas

DFT_PFA_SUMS

DFT_PFA_RaderComb

DFT_SplitRadix

DFT_DCT1andDST1

DFT_DFT1and3

DFT_CT_Inplace

DFT_SplitRadix_Inplace

DFT_PD

DFT_SR_Reg

DFT_PRDFT

DFT_GT_CT

DFT_tSPL_CT

DFT_DFTBR

DFT2 2

DFT2_CT

DFT2_PRDFT2

DFT3 5

DFT3_Base

DFT3_CT

DFT3_CT_Radix2

DFT3_OddToDFT1

DFT3_PRDFT3

DFT4 3

DFT4_Base

DFT4_CT

DFT4_PRDFT4

DFTBR 1

DFTBR_HW

DHT 3

DHT_Base

DHT_DCT1andDST1

DHT_Radix2

Cooley-Tukey FFT

Breakdown rules in Spiral:

“Teaches” Spiral about existing algorithm knowledge

(~200 journal papers)

Includes many new ones (algebraic theory)

Carnegie Mellon

SPL to Sequential Code

Example: tensor product

Correct code: easy fast code: very difficult

Carnegie Mellon

Program Generation in Spiral (Sketched)

Transform

user specified

C Code:

Fast algorithm

in SPL

many choices

∑-SPL:

[PLDI 05]

Iteration of this process

to search for the fastest

But that’s not all …

parallelization

vectorization

loop

optimizations

constant folding

scheduling

……

Optimization at all

abstraction levels

Carnegie Mellon

Organization

Spiral: Brief overview


Results

Conclusion

Carnegie Mellon

SPL to Shared Memory Code: Basic Idea [SC 06]

Governing construct: tensor product

p-way embarrassingly parallel, load-balanced

A

A

A

A

x y

Processor 0

Processor 1

Processor 2

Processor 3

Problematic construct: permutations produce false sharing

Task: Rewrite formulas to

extract tensor product + keep contiguous blocks

x y

Carnegie Mellon

Step 1: Shared Memory Tags

Identify crucial hardware parameters

Number of processors: p

Cache line size: μ

Introduce them as tags in SPL

This means: formula A is to be optimized for p processors

and cache line size μ

Carnegie Mellon

Step 2: Identify “Good” Formulas

Load balanced, avoiding false sharing

Tagged operators (no further rewriting necessary)

Definition: A formula is fully optimized for (p, μ) if it is one of

the above or of the form

where A and B are fully optimized.

Carnegie Mellon

Step 3: Identify Rewriting Rules

Goal: Transform formulas into fully optimized formulas

Formulas rewritten, tags propagated

There may be choices

Carnegie Mellon

Simple Rewriting Example

fully optimized

Loop splitting + loop exchange

Carnegie Mellon

Parallelization by Rewriting

Fully optimized (load-balanced, no false sharing)

in the sense of our definition

Carnegie Mellon

Same Approach for Other Parallel Paradigms

Vectorization: [IPDPS 02, VecPar 06]Message Passing: [ISPA 06]

Cg/OpenGL for GPUs: Verilog for FPGAs: [DAC 05]

MPI

With Bonelli, Lorenz, Ueberhuber, TU Vienna

With Shen, TU DenmarkWith Milder, Hoe, CMU

Carnegie Mellon

Going Beyond Transforms

Transform = linear operator with one vector input and one vector output

Key ideas: Generalize to (possibly nonlinear) operators with several inputs and several

outputs

Generalize SPL (including tensor product) to OL (operator language)

Generalize rewriting systems for parallelizations

We have first results beyond transforms!

Cooley-Tukey FFT in OL:

Viterbi in OL:

Mat-Mat-Mult:

Carnegie Mellon

Organization

Spiral overview


Results

Concluding remarks

Carnegie Mellon

Benchmarks

platforms

kernels

vector dual/quad core

GPU FPGA

DFT

All Spiral code shown

is “push-button” generated

from scratch

“click”

FPGA+CPU

Carnegie Mellon

DFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]

0

5

10

15

20

25

30

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

input size

Spiral 5.0 SPMD

Spiral 5.0 sequential

Intel IPP 5.0

FFTW 3.2 alpha SMP

FFTW 3.2 alpha sequential

Benchmark: Vector and SMP

2 processors

4 processors

2 processors 4 processors

Memory footprint < L1$ of 1 processor

25 Gflop/s!

4-way vectorized + up to 4-threaded + adapted to the memory hierarchy

Carnegie Mellon

DCT4, Multiples of 32: 4-way VectorizedDCT (single precision) 2.66 GHz Core2 (4-way 32-bit SSE)

performance [Gflop/s]

0

1

2

3

4

5

6

7

8

9

10

32 64 96 128 160 192 224 256

input size

Spiral DCT4

FFTW 3.1.2 DCT4 (k11)

IPP 5.1 DCT2 (?)

Less popular transform usually means no good code

Carnegie Mellon

Benchmark: Cell (1 processor = SPE)DFT (single precision) on 3.2 GHz Cell BE (Single SPE)performance [Gflop/s]

0

2

4

6

8

10

12

14

16

16 32 48 64 80 96 128 160 256 384 512 1024

input size

Generated using the simulator; run at Mercury (thanks to Robert Cooper)

Joint work with Th. Peter (ETH Zurich), S. Chellappa, M. Telgarsky, J. Moura (CMU)

Carnegie Mellon

Benchmark: GPU

0

1

2

3

4

5

6

8 10 12 14 16 18 20 22 24

log2(input size)

WHT (single precision) on 3.6 GHz Pentium 4 with Nvidia 7900 GTXperformance [Gflops/s]

Spiral CPU Spiral GPU

Joint work with H. Shen, TU Denmark

CPU + GPU

Carnegie Mellon

Benchmark: FPGA

DFT 256 on Xilinx Virtex 2 Pro FPGAinverse throughput (gap) [us]

0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000

Area [slices]

Xilinx Logicore 3.2

Spiral

better

Joint work with P. Milder, J. Hoe (CMU)

competitive with professional designs

much larger set of performance/area trade-offs

(Pareto-optimal designs)

Carnegie Mellon

Benchmarks

platforms

kernels

vector dual/quad core

GPU FPGA

DFT

filter

GEMM

Viterbi

All Spiral code shown

is “push-button” generated

from scratch

“click”

FPGA+CPU

Carnegie Mellon

Beyond Transforms : Viterbi Decoding

Viterbi decoding (8-bit) on 2.66 GHz Core 2 Duoperformance [Gbutterflies/s]

0

0.5

1

1.5

2

2.5

6 7 8 9 10 11 12 13

log2(size of state machine)

Spiral 16-way vectorized

Spiral scalar

Karn's Viterbi decoder

(hand-tuned assembly)

1 butterfly = ~22 ops

Vectorized using practically the same rules as for DFT

Joint work with S. Chellappa, CMU Karn: http://www.ka9q.net/code/fec/

Carnegie Mellon

Beyond Transforms: Matrix-Matrix-Multiply

work with F. de Mesmay, CMU

MMM(square matrix real double) 2 x Core2Duo 3Ghz

0

10

20

30

40

50

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638

performance [Gflop/s]

matrix side

Spiral generated (scalar)

ATLAS contributed

(pthreads+SSE)

Goto (pthreads+SSE)

Spiral generated

(pthreads + SSE)

ATLAS generated (scalar)

theoretical peak

All measurements with warm cache

ATLAS code is optimized for cold cache

Carnegie Mellon

Organization

Spiral overview


Results

Concluding remarks

Carnegie Mellon

Spiral Summary

Spiral: The computer implements and optimizes transforms Rigorous formal framework

Support for different platform paradigms (vector, SMP, GPU, FPGA, …)

Optimization at high abstraction level through rewriting+ empirical search over alternatives

The generated code is often faster than human written code

We have extended the framework beyond transforms

What we have learned Declarative representation of algorithms (mathematical DSL)

Optimization at a high, “right” level of abstraction using rewriting

It makes sense to use math to represent and optimize math functionality

It makes sense to “teach” the computer algorithms and math (does not become obsolete)

Domain-specific is necessary to get fastest code

One needs techniques from different disciplines …..

Carnegie Mellon

Attacking the Parallel Library ChallengeAutomation in High Performance Library Development

Programming languages

Program generation Symbolic Computation

Rewriting

Compilers

Software

Scientific ComputingAlgorithms

Mathematics

High-Performance

Parallel Library

Development

We Need to Work Together

can we teach computers to write fast...

Documents