high performance linear transform generation for the cell be€¦ · carnegie mellon cell broadband...

Carnegie Mellon

High Performance Linear Transform High Performance Linear Transform Program Generation for the Cell BEProgram Generation for the Cell BEfor the Cell BEVas Chellappa Franz Franchetti

for the Cell BEVas Chellappa Franz FranchettiFranz Franchetti Markus PüschelFranz Franchetti Markus Püschel

Electrical & Computer EngineeringCarnegie Mellon UniversityElectrical & Computer EngineeringCarnegie Mellon University

Sponsors: DARPA‐DESA, NSF, ARO, and Mercury Inc.

Carnegie Mellon

Cell Broadband Engine

Multicore cpu (8 SPEs+1 PPE)Cell BE ChipCell BE Chip EIB

SPELSSPE LS

SPEs: SIMD cores designed for numerical computing

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

256KB “local store” per SPE (scratchpad‐like)

Main Mem

SPELSSPE LS

(scratchpad like)

Programmer‐driven DMA

204 Gflop/s peak

2How do we harness the Cell’s impressive peak performance?

Carnegie Mellon

DFT on the Cell BE

Spiral generated(this paper)

FFTC350x

Numerical Recipes

FFTW

Numerical Recipes

3Platform‐tuned code is 350x faster. But hard to write!

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance ResultsPerformance Results

Concluding Remarks

4

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

Carnegie Mellon

“Fitting” Dataflow to Hardware

Stage 1

Stage 2

Core 0

Stage 3

Stage 1Stage 2Stage 3Stage 4

Stage 5 Stage 4

Core 1

Stage 1Stage 2Stage 3Stage 4

Iterative Algorithm (programming ease)Recursive algorithm (memory hierarchy)Parallel execution (multicore)To “fit” DFT to architecture:

Various traversalsVarious traversals

Various factorizations

5How to map dataflow to architecture automatically?

Carnegie Mellon

“Fitting” Dataflow to Platform (contd.)

1234

5

1

2

3

4

Core 0

Core 1

6Intuition: rewrite formulas to obtain suitable dataflow

Carnegie Mellon

Program Generation in SpiralTransformuser specified

Optimization at allabstraction levels

Fast algorithmin SPLmany choices

parallelizationvectorization

many choices

∑‐SPL loop optimizationsoptimizations

C Code Iteration of this process to search for the fastest

constant foldingscheduling

7

But that’s not all ………

Carnegie Mellon

Common Abstraction: SPLSPL: Tensor‐product representationEg.: Cooley‐Tukey fast Fourier transform (FFT):

1 1 1 1 1 1 1 1 1 1⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1

j j⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥

8

1 1 1 1 1 1 1j j j⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

Tensor products in SPL represent loop structures

Carnegie Mellon

Overview




Concluding Remarks

9

Carnegie Mellon

Mapping DFTs to the Cell

Objective: High‐performance transform library for Cell BE

Cell BE ChipCell BE Chip EIBSPELSSPE LSySPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LSDFT

Main Mem

Cell’s architectural paradigms:

Vectorize DFT for Parallelize DFT across SPE d

Optimize DFT for th h t

vector length ν

Vectorization

p SPEs, and use a DMA packet size of μ

Parallelization

throughput (s DFTs required)

Multibuffering

10Tags guide formula rewriting

g

Carnegie Mellon

SPL to Parallel Code

Natural parallel construct in SPL:AA

Processor 0

AAA

Processor 1

Processor 2

Processor 3

x y

Parallelizing other constructs in SPL:

Independent, load‐balanced, communication‐free operation

Parallelizing other constructs in SPL:

Permutations require message exchange (on‐chip DMA comm.)x y

11Idea: rewrite all SPL constructs to parallel constructs + on‐chip DMA

Carnegie Mellon

SPL to Streaming Code

Streaming: Overlapping computation with communicationOn‐chip (SPE ↔ SPE) and off‐chip (SPE ↔ Main memory)p ( ) p ( y)

Idea: tensor loops become multi‐buffered loops

i'th iteration

AAA

Write Ai‐1

Compute Ai

Read A

i'th iteration

(Trickier for other

Useful for:

A

x y

Read Ai+1(Trickier for other SPL constructs)

Useful for:Throughput‐optimized code

Large, out‐of‐chip sizes

12Idea: rewrite algorithm at SPL level to achieve largest DMA packets

Carnegie Mellon

Generating Cell CodeTransformuser specified

Fast algorithmi SPL

Rewriting

SIMD kernel optimized for

in SPLtag guided

Streamed from memory for

All‐to‐all communication

Load balanced across p SPEs optimized for

memory hierarchy

Loop

memory for throughput

communication (on‐chip)

across p SPEs

Loop operations in ∑‐SPL

13

Cell‐specific optimized C code (intrinsics, DMA etc.)

Carnegie Mellon

Generated Code Sample

/* Complex‐to‐complex DFT size 64 on 2 SPEs */

dft c2c 64(float *X float *Y int spuid)dft_c2c_64(float X, float Y, int spuid){ // Block 1 (IxA)L

for(i:=0; i<=7; i++) // Right most gather{ DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma()

spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather

DMA

parallelized

vectorized// compute vectorized DFT kernel of size mfor(i:=0; i<=7; i++) // Scatter at interface{ DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) }

all_to_all_synchronization_barrier(); // uses mailbox msgs

// Block 2 (AxI)/* Gather is a no operation since the scatter above

accounted for it */// compute vectorized DFT kernel of size nfor(i:=0; i<=7; i++) // Left most scatter{ DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) }

all_to_all_synchronization_barrier();}

14DFT 216: 4,000+ lines of code!

Carnegie Mellon

Problem Space: Options p pParallelization

Base (Vectorized)

SPE

SPE SPE

SPE

DFTSPE

DFT

Vectorization assumed Single DFT

parallelized across

SPE

DFT

SPE

DFT

SPESPE

Main Memory Operations(Only for

ll DFT )

pmultiple SPEs

SPE

DFT

SPE

DFT

SPE

DFT

Latency optimized(default)

small DFTs)Multiple independent DFTs on multiple SPEs

SPE SPEDFT SPE SPEDFT

SPE

DFTDFT

Throughput, Multiple parallelized independent DFTs

15

SPESPE SPESPEmultibuffered independent DFTs

Carnegie Mellon

Problem Space: CombinationsThroughput‐optimized usage scenariosThroughput‐optimized usage scenariosLatency‐optimized usage scenariosLatency‐optimized usage scenarios

SPE SPEDFT SPE SPEDFTDFT

Parallel Single DFT from SPESPE SPESPE

Parallel, multibuffered DFT

Single DFT from main memory

SPE

DFTDFT

SPE

DFTDFT

DFT DFT

SPE

DFTDFT

SPE

DFTDFT

Independent DFTs multibuffered in parallel

16Devise rewrite rules for tags. Nestings describe all scenarios

Carnegie Mellon

Overview




Concluding Remarks

17

Carnegie Mellon

SPE

SPE SPE

SPE

DFT8‐SPEs

4‐SPEs

2‐SPEs

1‐SPE

18

Carnegie Mellon

SPE SPE

SPE

SPE SPE

SPE

DFT

Spiral: 8‐SPEs

FFTC

FFTW

Spiral: 1‐SPE

4.5x faster than FFTW, 1.63x faster than FFTC19

Carnegie Mellon

More Performance Results

Single‐SPE DFT codeS i l

Mercury

IBM SDK

SpiralChow

Split/interleaved complex formats

Non‐2‐power sizes

Double precision (PowerXCell 8i)

20

Carnegie Mellon

Other Linear Transforms

Discrete Sine, Cosine transforms, DFT with real inputs (single‐SPE)

2‐D DFTs

Out‐of‐core sizes Limited to 2D DFTs on 1‐SPE (for now)( o o )

21

More performance results:Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

Carnegie Mellon

Overview




Concluding Remarks

22

Carnegie Mellon

Conclusion

Automatic generation of transform libraries High performanceHigh performance

Variety of scenarios, formats

High performance on Cell requires:Vectorization multi‐core parallelization, streaming, DMA code

F t lik l t h i il di t d ffFuture processors likely to have similar paradigms, tradeoffs

Spiral approach:Spiral approach:Common abstraction of transform, algorithm, architecture (SPL)

Rewrite rules to go from transform to architecturearchitecture algorithm

23

space space

high performance linear transform generation for the cell be€¦ · carnegie mellon cell broadband...

Documents