high performance linear transform generation for the cell be€¦ · carnegie mellon cell broadband...
TRANSCRIPT
Carnegie Mellon
High Performance Linear Transform High Performance Linear Transform Program Generation for the Cell BEProgram Generation for the Cell BEfor the Cell BEVas Chellappa Franz Franchetti
for the Cell BEVas Chellappa Franz FranchettiFranz Franchetti Markus PüschelFranz Franchetti Markus Püschel
Electrical & Computer EngineeringCarnegie Mellon UniversityElectrical & Computer EngineeringCarnegie Mellon University
Sponsors: DARPA‐DESA, NSF, ARO, and Mercury Inc.
Carnegie Mellon
Cell Broadband Engine
Multicore cpu (8 SPEs+1 PPE)Cell BE ChipCell BE Chip EIB
SPELSSPE LS
SPEs: SIMD cores designed for numerical computing
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
256KB “local store” per SPE (scratchpad‐like)
Main Mem
SPELSSPE LS
(scratchpad like)
Programmer‐driven DMA
204 Gflop/s peak
2How do we harness the Cell’s impressive peak performance?
Carnegie Mellon
DFT on the Cell BE
Spiral generated(this paper)
FFTC350x
Numerical Recipes
FFTW
Numerical Recipes
3Platform‐tuned code is 350x faster. But hard to write!
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance ResultsPerformance Results
Concluding Remarks
4
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
Carnegie Mellon
“Fitting” Dataflow to Hardware
Stage 1
Stage 2
Core 0
Stage 3
Stage 1Stage 2Stage 3Stage 4
Stage 5 Stage 4
Core 1
Stage 1Stage 2Stage 3Stage 4
Iterative Algorithm (programming ease)Recursive algorithm (memory hierarchy)Parallel execution (multicore)To “fit” DFT to architecture:
Various traversalsVarious traversals
Various factorizations
5How to map dataflow to architecture automatically?
Carnegie Mellon
“Fitting” Dataflow to Platform (contd.)
1234
5
1
2
3
4
Core 0
Core 1
6Intuition: rewrite formulas to obtain suitable dataflow
Carnegie Mellon
Program Generation in SpiralTransformuser specified
Optimization at allabstraction levels
Fast algorithmin SPLmany choices
parallelizationvectorization
many choices
∑‐SPL loop optimizationsoptimizations
C Code Iteration of this process to search for the fastest
constant foldingscheduling
7
But that’s not all ………
Carnegie Mellon
Common Abstraction: SPLSPL: Tensor‐product representationEg.: Cooley‐Tukey fast Fourier transform (FFT):
1 1 1 1 1 1 1 1 1 1⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1
j j⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
8
1 1 1 1 1 1 1j j j⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Tensor products in SPL represent loop structures
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance ResultsPerformance Results
Concluding Remarks
9
Carnegie Mellon
Mapping DFTs to the Cell
Objective: High‐performance transform library for Cell BE
Cell BE ChipCell BE Chip EIBSPELSSPE LSySPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LSDFT
Main Mem
Cell’s architectural paradigms:
Vectorize DFT for Parallelize DFT across SPE d
Optimize DFT for th h t
vector length ν
Vectorization
p SPEs, and use a DMA packet size of μ
Parallelization
throughput (s DFTs required)
Multibuffering
10Tags guide formula rewriting
g
Carnegie Mellon
SPL to Parallel Code
Natural parallel construct in SPL:AA
Processor 0
AAA
Processor 1
Processor 2
Processor 3
x y
Parallelizing other constructs in SPL:
Independent, load‐balanced, communication‐free operation
Parallelizing other constructs in SPL:
Permutations require message exchange (on‐chip DMA comm.)x y
11Idea: rewrite all SPL constructs to parallel constructs + on‐chip DMA
Carnegie Mellon
SPL to Streaming Code
Streaming: Overlapping computation with communicationOn‐chip (SPE ↔ SPE) and off‐chip (SPE ↔ Main memory)p ( ) p ( y)
Idea: tensor loops become multi‐buffered loops
i'th iteration
AAA
Write Ai‐1
Compute Ai
Read A
i'th iteration
(Trickier for other
Useful for:
A
x y
Read Ai+1(Trickier for other SPL constructs)
Useful for:Throughput‐optimized code
Large, out‐of‐chip sizes
12Idea: rewrite algorithm at SPL level to achieve largest DMA packets
Carnegie Mellon
Generating Cell CodeTransformuser specified
Fast algorithmi SPL
Rewriting
SIMD kernel optimized for
in SPLtag guided
Streamed from memory for
All‐to‐all communication
Load balanced across p SPEs optimized for
memory hierarchy
Loop
memory for throughput
communication (on‐chip)
across p SPEs
Loop operations in ∑‐SPL
13
Cell‐specific optimized C code (intrinsics, DMA etc.)
Carnegie Mellon
Generated Code Sample
/* Complex‐to‐complex DFT size 64 on 2 SPEs */
dft c2c 64(float *X float *Y int spuid)dft_c2c_64(float X, float Y, int spuid){ // Block 1 (IxA)L
for(i:=0; i<=7; i++) // Right most gather{ DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma()
spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather
DMA
parallelized
vectorized// compute vectorized DFT kernel of size mfor(i:=0; i<=7; i++) // Scatter at interface{ DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) }
all_to_all_synchronization_barrier(); // uses mailbox msgs
// Block 2 (AxI)/* Gather is a no operation since the scatter above
accounted for it */// compute vectorized DFT kernel of size nfor(i:=0; i<=7; i++) // Left most scatter{ DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) }
all_to_all_synchronization_barrier();}
14DFT 216: 4,000+ lines of code!
Carnegie Mellon
Problem Space: Options p pParallelization
Base (Vectorized)
SPE
SPE SPE
SPE
DFTSPE
DFT
Vectorization assumed Single DFT
parallelized across
SPE
DFT
SPE
DFT
SPESPE
Main Memory Operations(Only for
ll DFT )
pmultiple SPEs
SPE
DFT
SPE
DFT
SPE
DFT
Latency optimized(default)
small DFTs)Multiple independent DFTs on multiple SPEs
SPE SPEDFT SPE SPEDFT
SPE
DFTDFT
Throughput, Multiple parallelized independent DFTs
15
SPESPE SPESPEmultibuffered independent DFTs
Carnegie Mellon
Problem Space: CombinationsThroughput‐optimized usage scenariosThroughput‐optimized usage scenariosLatency‐optimized usage scenariosLatency‐optimized usage scenarios
SPE SPEDFT SPE SPEDFTDFT
Parallel Single DFT from SPESPE SPESPE
Parallel, multibuffered DFT
Single DFT from main memory
SPE
DFTDFT
SPE
DFTDFT
DFT DFT
SPE
DFTDFT
SPE
DFTDFT
Independent DFTs multibuffered in parallel
16Devise rewrite rules for tags. Nestings describe all scenarios
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance ResultsPerformance Results
Concluding Remarks
17
Carnegie Mellon
SPE
SPE SPE
SPE
DFT8‐SPEs
4‐SPEs
2‐SPEs
1‐SPE
18
Carnegie Mellon
SPE SPE
SPE
SPE SPE
SPE
DFT
Spiral: 8‐SPEs
FFTC
FFTW
Spiral: 1‐SPE
4.5x faster than FFTW, 1.63x faster than FFTC19
Carnegie Mellon
More Performance Results
Single‐SPE DFT codeS i l
Mercury
IBM SDK
SpiralChow
Split/interleaved complex formats
Non‐2‐power sizes
Double precision (PowerXCell 8i)
20
Carnegie Mellon
Other Linear Transforms
Discrete Sine, Cosine transforms, DFT with real inputs (single‐SPE)
2‐D DFTs
Out‐of‐core sizes Limited to 2D DFTs on 1‐SPE (for now)( o o )
21
More performance results:Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance ResultsPerformance Results
Concluding Remarks
22
Carnegie Mellon
Conclusion
Automatic generation of transform libraries High performanceHigh performance
Variety of scenarios, formats
High performance on Cell requires:Vectorization multi‐core parallelization, streaming, DMA code
F t lik l t h i il di t d ffFuture processors likely to have similar paradigms, tradeoffs
Spiral approach:Spiral approach:Common abstraction of transform, algorithm, architecture (SPL)
Rewrite rules to go from transform to architecturearchitecture algorithm
23
space space