bridging the performance- productivity gap with selective ... › files › archive ›...
TRANSCRIPT
![Page 1: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/1.jpg)
BERKELEY PAR LAB
Bridging the Performance-Productivity Gap with Selective
Embedded Just-In-Time Specialization
Shoaib Kamil CSAIL, MIT
Armando Fox, Katherine Yelick UPCRC, EECS Dept, UC Berkeley
1
BERKELEY PAR LAB
DSMC 2012
![Page 2: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/2.jpg)
BERKELEY PAR LAB
Productivity-Performance Gap
• Domain scientists want to write code in high-level languages that match domain
x = A \ b or model = gmm(….)
• Not worry about typing, parallelism, etc implicit def s2r[A,_,I<:Seq[A]](xs: I) {… } 2 DSMC 2012
![Page 3: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/3.jpg)
BERKELEY PAR LAB
Productivity-Performance Gap
• Domain scientists want to write code in high-level languages that match domain
• For best performance, must rely on efficiency programmer
• Optimized code highly dependent on platform
1/10 LOC 1/100 Performance
10-100x LOC 100x Performance
3 Bryan Catanzaro & PALLAS Group
DSMC 2012
![Page 4: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/4.jpg)
BERKELEY PAR LAB
4
4
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-Parallelism Divide and Conquer
Data-Parallelism Pipeline
Discrete-Event Geometric-Decomposition Speculation
SPMD Data-Par/index-space
Fork/Join Actors
Distributed-Array Shared-Data
Shared-Queue Shared-map Partitioned Graph
MIMD SIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-Passing Collective-Comm. Transactional memory
Thread-Pool Task-Graph
Data structure Program structure
Point-To-Point-Sync. (mutual exclusion) collective sync. (barrier) Memory sync/fence
Loop-Par. Task-Queue
Transactions
Thread creation/destruction Process creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
A = M x V
Software Stack Deals with
Implementation
“Our” Pattern Language (OPL-2010) (Kurt Keutzer, Tim Mattson)
DSMC 2012
![Page 5: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/5.jpg)
BERKELEY PAR LAB
Example: Stencil Computations
for (int i=1; i<nx-1; i++)
for (int j=1; j<ny-1; j++)
output[i,j] = f(output[i,j], neighbors(input[i,j]))
• The function f() changes application-to-application
• Tuning of loops requires information about input set
5 DSMC 2012
x
y
z
x
y
z
x
y
z
![Page 6: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/6.jpg)
BERKELEY PAR LAB
What is an embedded DSL
• DSL compiler using a host language’s syntax
– Common example: macro rewriting as in Lisp
– Difficulty depends on language capabilities
• Leverage capabilities of host language
• Often not same semantics
– Contrast with APIs & libraries
6 DSMC 2012
![Page 7: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/7.jpg)
BERKELEY PAR LAB
“Stovepipes”: Using DSELs to Architect Applications
Multicore GPU “Cloud”
App 1 App 2 App 3
Dense Sparse Graph Trav.
Single program expresses computation, “stovepipes” turn computation into optimized code at run-time.
7
DSMC 2012 7
![Page 8: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/8.jpg)
BERKELEY PAR LAB
Overview
• Motivation: Productivity-Performance Gap
• SEJITS Methodology
• Asp & DSELs for Python
• Mechanisms for DSEL Implementation
• Future/Current Work
• Conclusions
8 DSMC 2012
![Page 9: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/9.jpg)
BERKELEY PAR LAB
.py
OS/HW
f() B.h()
Asp Framework
.c
Inte
rpre
ter DSEL Compiler
Productivity app
.so
cc/ld
cache
Selected Embedded JIT Specialization
9 DSMC 2012
![Page 10: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/10.jpg)
BERKELEY PAR LAB Selected Embedded Just-In-Time Specialization (SEJITS)
• Domain scientists (aka productivity programmers) write code in embedded DSLs
• Efficiency programmers create embedded DSLs instead of one-off libraries or application optimization
• Separation of concerns
• “Invisible” to productivity programmers
– Except it runs fast
10 DSMC 2012
![Page 11: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/11.jpg)
BERKELEY PAR LAB
SEJITS Methodology
• Goal: productive portable performance
• Add DSEL support to productivity languages – Leverage features of modern “scripting” languages
– Leverage existing libraries for these languages
• Use external parallelizing/optimizing compilers – Leverage existing expertise of efficiency programmers
– Leverage existing high-performance external libraries
• Use auto-tuning – Search over multiple implementations
11 DSMC 2012
![Page 12: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/12.jpg)
BERKELEY PAR LAB Auto-tuning: Empirical Search for Best Performance
• A priori determining the best low-level code is difficult, even for expert programmers
• Idea: generate many parameterized versions of a kernel
• Run all of them on the target machine and choose the fastest
• Usually run at install-time
12
![Page 13: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/13.jpg)
BERKELEY PAR LAB
Auto-tuning Matrix Multiply
13
0 50 100 150 200 250 3000
10
20
30
40
50
60
C, 3 nested loops
PHiPAC
Square matrix size
MFLOPS
Figure 10: Performance of single precision matrix multiply on a Sparcstation-20/61.
0 50 100 150 200 250 3000
10
20
30
40
50
60
PHiPAC
SGI R4k assembly libblas_mips2_serial.a SGEMM
C, 3 nested loops
Square matrix size
MFLOPS
Figure 11: Performance of single precision matrix multiply on a 100 MHz SGI Indigo R4K. We show the SGEMM
from SGI’s libblas mips2 serial.a library.
0 50 100 150 200 250 3000
10
20
30
40
50
60
70
80
Square matrix sizes
MFLOPS
PHiPAC
Vendor DGEMM
FORTRAN, 3 nested loops
Figure 12: Performance of double precision matrix multiply on a HP 712/80i. We show DGEMM from the pa1.1
version of libvec.a in HP’s compiler distribution.
25
Naïve Code
Vendor-provided Expert Code
Auto-tuned Code
Bilmes et al. PhiPAC Tech Report
Bet
ter
![Page 14: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/14.jpg)
BERKELEY PAR LAB
Asp is SEJITS for Python
• Proof of concept framework for “easy-to-build” embedded parallel DSLs
• Pragmatic choice: Python used in scientific community
• DSL implementers can use some or all of the building blocks provided
http://sejits.org
14 DSMC 2012
![Page 15: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/15.jpg)
BERKELEY PAR LAB
Implemented DSELs/Libraries
15
DSEL/Library Platforms
Stencil/Structured Grid x86+OpenMP
Semantic Graphs Filtering & Semiring Operations in KDT x86+MPI
Parallel Map x86+processes, cloud
Gaussian Mixture Modeling CUDA, Cilk Plus
CA Matrix Powers for CA Krylov Subspace Methods x86+pthreads
Bag of Little Bootstraps* x86+Cilk Plus, Cloud via Spark
GraphLab DSEL for Machine Learning via Graphs* x86+pthreads
CA Parallel Recursive Structural Pattern* x86+Cilk Plus DSMC 2012
![Page 16: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/16.jpg)
BERKELEY PAR LAB
Stencil DSEL Performance
11x faster than auto-parallelizing. ~2.5x faster than state of art non-auto-tuning DSL
Geometric mean of 93% of attainable peak.
16 DSMC 2012
![Page 17: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/17.jpg)
BERKELEY PAR LAB
Stencil DSEL Performance
DSMC 2012 17
![Page 18: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/18.jpg)
BERKELEY PAR LAB Communication-Avoiding Recursive Matrix Multiply
• For recursive algorithms with particular branching factor relative to memory usage
• Choose when to perform parallel steps vs serial steps
• Optimal choice attains lower bounds on communication for matrix multiply
CScADS Autotuning 2012 18
![Page 19: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/19.jpg)
BERKELEY PAR LAB CARMA Performance on NUMA Machine
DSMC 2012 19
Beat MKL by 10x, using MKL.
Libshitz, Schwartz, Eliahu, Spillinger, Demmel, K.
![Page 20: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/20.jpg)
BERKELEY PAR LAB
Mechanism: Code Templates
• Code snippets in backend language with interspersed with Python
• For “simple” code generation
20
void vec_add(float *x, float *y) { % for i in range(vectorsize): x[${i}] += y[${i}]; % endfor }
void vec_add(float *x, float *y) { x[0] += y[0]; x[1] += y[1]; x[2] += y[2]; }
DSMC 2012
![Page 21: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/21.jpg)
BERKELEY PAR LAB
Mechanism: Phased Transformations
Parse to Python AST
User code
Convert to Domain-Specific
IR
Optimize IR
Convert to Backend AST
Optimize Backend AST
Write Out Source Files
Call External Compiler
Load & Run Shared Lib
Return Value
21 DSMC 2012
![Page 22: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/22.jpg)
BERKELEY PAR LAB
Example Code
from stencil_kernel import *
class Laplacian3D(StencilKernel):
def kernel(self, in_grid, out_grid):
for x in self.interior_points(out_grid):
for y in self.neighbors(in_grid, x, 1):
out_grid[x] += (1.0/6.0) * in_grid[y]
22 DSMC 2012
![Page 23: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/23.jpg)
BERKELEY PAR LAB
Example Code
from stencil_kernel import *
class Laplacian3D(StencilKernel):
def kernel(self, in_grid, out_grid):
for x in self.interior_points(out_grid):
for y in self.neighbors(in_grid, x, 1):
out_grid[x] += (1.0/6.0) * in_grid[y]
23 DSMC 2012
![Page 24: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/24.jpg)
BERKELEY PAR LAB
Optimized Output
24
Cache blocking
Parallelization U
nro
llin
g/r
egis
ter
blo
ckin
g
DSMC 2012
![Page 25: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/25.jpg)
BERKELEY PAR LAB
Future/Current Work
• Improved auto-tuning via machine learning
• SEJITS + fast hardware prototyping & co-tuning
– CHISEL project
• Composition in pattern-based frameworks
• Multi-level debugging
• Synthesizing optimized code (versus compiling)
25 DSMC 2012
![Page 26: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/26.jpg)
BERKELEY PAR LAB
Related Work
• Delite (Stanford)
• Petabricks (MIT) – Smart auto-tuning for algorithmic choice
• Auto-tuning compilers (Mary Hall) – User-guided auto-tuning for general compilers
– Difficult to automate due to domain knowledge required
• Auto-tuning motifs – PhiPAC, FFTW, ATLAS, OSKI, Spiral, & more
DSMC 2012 26
![Page 27: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/27.jpg)
BERKELEY PAR LAB
Conclusions
• High performance productive programming is possible with the SEJITS approach
• Also makes easier to write autotuners
• Much work in progress to make it even more easier to use
• BSD Licensed, available
http://www.sejits.org/
27 DSMC 2012
![Page 28: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/28.jpg)
BERKELEY PAR LAB
Acknowledgements
• Armando Fox, Katherine Yelick • Parlab professors: Krste Asanović, Ras Bodik, James Demmel, Armando Fox, Kurt
Keutzer, John Kubiatowicz, David Patterson, Koushik Sen, David Wessel, Katherine Yelick
• Grad students: Scott Beamer, Derrick Coetzee, Henry Cook, Michael Driscoll, Ekaterina Gonina, Jeffrey Morlan, Jonathan Harper, Erin Carson, Nick Knight
• LBNL/external: Aydin Buluc, Sam Williams, Adam Lugowski, John Gilbert, Leonid Oliker, John Shalf
• Intel/Microsoft: Burton Smith, Tim Mattson, Henry Gabb, Robert Geva, Juan Vargas
• Many undergrads
This work was performed at the UC Berkeley Parallel Computing Laboratory (Par Lab), supported by DARPA (contract #FA8750-10-1-0191) and by the Universal Parallel Computing Research Centers (UPCRC) awards from Microsoft Corp. (Award #024263) and Intel Corp. (Award #024894), with matching funds from the UC Discovery Grant (#DIG07-10227) and additional support from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Oracle, and Samsung.
28 DSMC 2012
![Page 29: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/29.jpg)
BERKELEY PAR LAB
BACKUP SLIDES
29 DSMC 2012
![Page 30: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/30.jpg)
BERKELEY PAR LAB
Introspect to Get AST
30 DSMC 2012
![Page 31: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/31.jpg)
BERKELEY PAR LAB
Transform into IR
31
Domain Specific Constructs
DSMC 2012
![Page 32: Bridging the Performance- Productivity Gap with Selective ... › files › archive › DSMC_Wkshp_Session03.pdf · Auto-tuning Matrix Multiply BERKELEY PAR LAB 13 0 50 100 150 200](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0d028f7e708231d4383aee/html5/thumbnails/32.jpg)
BERKELEY PAR LAB Transform into Platform AST & Optimize
32
• Bulk of performance expert’s knowledge
• Use of Asp’s infrastructure for common transformations
• Can generate many variants at once ( for auto-tuning)
DSMC 2012