composition and reuse with compiled domain-specific languages · composition and reuse with...
TRANSCRIPT
Composition and Reuse with Compiled Domain-Specific Languages
Pervasive Parallelism (PPL) Lab (http://ppl.stanford.edu) and LAMP (http://lamp.epfl.ch) Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun
Tiark Rompf, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky
Compiled Embedded Parallel DSLs Domain-Specific Languages for Heterogeneous Parallelism
Contact information Open Source Repository http://stanford-‐ppl.github.com/Delite/ Group Website http://ppl.stanford.edu
E-mail Addresses (Students) [email protected],
E-mail Address (Faculty Advisor) [email protected]
DSL Re-Use
Delite and Lightweight Modular Staging
OptiLA: Linear Algebra • Dense/Sparse Vector/Matrix • Can be re-used & extended by other DSLs
OptiML: Machine Learning • Extends OptiLA • Adds ML-specific constructs: e.g., TrainingSet, untilconverged
OptiCollections: Parallel Collections • Arrays, Maps - Parallel collection interface using Delite ops • Designed to be used by other DSLs
OptiQL: Data Querying • In-memory collection-based querying • Utilizes OptiCollections’ HashMap
OptiMesh: Mesh-based PDEs • Liszt on Delite • More advanced DSL analyses and optimizations
OptiGraph: Small-world Graphs • Green-Marl on Delite • Shares analyses / transformations with OptiMesh
DSL Interoperability
Evaluation
OptiQL { // customers: Array[Customer] // orders: Array[Order] val orders = customers Join(orders) WhereEq(_.c_custkey, _.o_custkey) Select((c,o) => new Result { val nationKey = c.c_nationkey val acctBalance = c.c_acctbal val price = o.o_totalprice }) orderData.set(orders) } OptiML { // run linear regression on price val data = Matrix(orderData.get) val x = data.sliceCols(0,1) val y = data.getCol(2) theta.set(linreg.weighted(x,y).toArray) } println("theta: " + theta.get)
DSL Delite Ops Generic Optimizations
OptiML Map, ZipWith, Reduce, Foreach, Hash, Sort
CSE, DCE, code motion, fusion
OptiQL Map, Reduce, Filter, Sort, Hash, Join
CSE, DCE, code motion, fusion
OptiGraph ForeachReduce, Map, Reduce, Filter
CSE, DCE, code motion
OptiMesh ForeachReduce CSE, DCE, code motion
OptiCollections Map, FlatMap, Reduce, Filter, Sort, Hash, ZipWith
CSE, DCE, code motion, fusion
Delite: Reusable infrastructure for rapid development of compiled embedded DSLs
We modified the scala-virtualized compiler to introduce scopes, coarse-grained execution blocks that are compiled, staged and executed independently Scopes can communicate via global heap objects
Lexer% Parser% Type%checker%
Embedded%front7end%
External%front7end%Full%embedding%(composable)%Front7end%embedding%(not%composable)%
Spectrum)of)compiled)domain1specific)languages)
Stand7alone%(not%composable)% Back7end%embedding%(composable)%
Analysis% OpCmizaCon% Code%gen%
Embedded%back7end%
External%back7end%
Why Compiled Domain-Specific Languages? How do we compose them?
Delite&Execu+on&Graph&
Scala&CUDA&OpenCL&
gen&ops¶llel&
domain&
mul$view)IR) traversals) generated)code)
."."."
Code
&gen
erators&
data&structures&
Gen&analyses&
Gen&transformers&
Generic&op+miza+on&
DS&analyses&
DS&transformers&
DS&op+miza+on&
Example DSL: Vectors val (v1,v2) = (Vector.rand(1000), Vector.rand(1000)) val a = v1+v2 println(a)
Lightweight Modular Staging: Embedded DSL compilation with metaprogramming • All application types are wrapped in an abstract type constructor, Rep[T]
• The DSL returns type-safe placeholders instead of immediate results, allowing it to build an intermediate representation, optimize it, and generate code
• The DSL is embedded in Scala, a general purpose programming language
• The syntax is legal Scala
• When the program is executed, it builds an intermediate representation instead of computing a result (DSL compilation)
• The DSL implementation uses Delite ops to internally represent operations as parallel patterns: • case class VectorPlus extends DeliteOpZipWith { def func = (a,b) => a + b }
• Delite generates code from the DSL IR to multiple platforms (C++, Scala, CUDA, OpenCL)
Vector.rand Vector.rand
VectorPlus
Println
Scala code
C++ code
CUDA code
OpenCL code
• We want programs that satisfy the 3 P’s: • Performance, Productivity, Portability
• Very hard to achieve with a general-purpose language: compilers lack semantic information necessary to efficiently translate to parallel hardware
• Domain-specific languages have been shown to be capable of providing the 3 P’s in some domains, but: • Embedded DSLs are easy to build and compose, but
don’t perform well • Stand-alone DSLs are hard to build and compose, but
perform well • We propose using compiled DSLs embedded in a
common back-end compiler
Two main ways: • Open World: fine-grained cooperative composition
• Embedded DSLs that are designed to be composed • No whole program analyses or optimizations that rely on
domain assumptions or restricted semantics • DSL composition reduces to object composition in the host lang trait Vectors extends MathOps with ScalarOps with VectorOps trait Matrices extends Vectors with MatrixOps
• Closed World: coarse-grained isolated composition • Embedded DSLs that require isolation to guarantee safety of
optimizations • 3 steps: 1) independently parse DSL blocks and apply domain-
specific optimizations, 2) lower each DSL block to a common, high-level IR, 3) optimize and code generate resulting IR
Mix-in composition
OptiQL LINQ OptiQL LINQ OptiQL LINQ OptiQL LINQ0.91 52.73 0.188 19.29 182.7 663.76 31.54 154.56
55.65 22.83 182.03 668.7 31.82 140.0255.7 28.17 182.46 673.11 31.71 137.96
184.21 31.82183.36 31.68
0.91 54.69333 0.188 23.43 182.952 668.5233 31.714 144.18 (raw)0.016638 1 0.003437 0.428389 0.273666 1 0.047439 0.215669 (time)60.10256 1 290.922 2.334329 3.654091 1 21.07975 4.636727 (speedup)
total speedupgroup/soa 0.91 4.461538fusion 4.06 2.118227staging 8.6 2.161628array 18.59 2.065627lift date 38.4 1.424219iterable 54.69
OptiQLQ1 Q2
1 thread 8 threads 1 thread 8 threads
1.0
2.3
60.1 291 0
0.2
0.4
0.6
0.8
1
1.2
1P 8P
No
rmal
ize
d E
xecu
tio
n T
ime
5GB 1.0
4.6 3.7
21.1 0
0.2
0.4
0.6
0.8
1
1.2
1P 8P
LINQ
OptiQL
5GB
1 2 4 8 1 2 4 82.1414 1.6862 1.0656 0.6824 28.95 13.67 7.45 4.93
28.04 12.47 7.46 527.57 13.4 7.47 4.9927.62 12.71 7.36 5.0328.03 12.66 7.53 5.31
2.1414 1.6862 1.0656 0.6824 28.042 12.982 7.454 5.0520.297541 0.234292 0.148062 0.094817 0.614857 0.284647 0.163439 0.1107723.360885 4.268177 6.753941 10.5466 1.626394 3.513121 6.118505 9.02758
463 MB75 MBOptiCollections
1.0
1.9
3.4
5.5
1.6
3.5 6.1
9.0
0
0.2
0.4
0.6
0.8
1
1.2
1P 2P 4P 8P
Scala ParallelCollectionsOptiCollections
463 MB 1.0
1.7
2.4 2.8
3.4 4.3
6.8 10.5
0
0.2
0.4
0.6
0.8
1
1.2
1P 2P 4P 8P
No
rma
lize
d E
xecu
tio
n T
ime
75 MB
C++1 2 4 8 1 2 4 80.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.03
0.05 0.03 0.03 0.03 0.09 0.05 0.03 0.03 0.03 (raw)1 0.6 0.6 0.6 1.8 1 0.6 0.6 0.6 (time)1 1.666667 1.666667 1.666667 0.555556 1 1.666667 1.666667 1.666667 (speedup)
0.555556 0.333333 0.333333 0.333333 1 0.555556 0.333333 0.333333 0.333333 (time rel. c++)1.8 3 3 3 1 1.8 3 3 3 (speedup rel. c++)
100k x 800kGreen MarlOptiGraph
1.7 1.7 1.7
0
0.2
0.4
0.6
0.8
1
1P 2P 4P 8P
No
rma
lize
d E
xe
cuti
on
Tim
e
100k nodes x 800k edges
1.0
1.8
2.9 3.3
.75
1.5
2.9 3.6
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1P 2P 4P 8P
Green Marl
OptiGraph
8M nodes x 64M edges
GPU Liszt GPU Liszt CPU GPU Liszt GPU Liszt CPU2.6 2.199043 73.9821 1.641 1.39132 5.03738 (raw)
0.035144 0.029724 1 0.325765 0.276199 1 (time) 0.035144 0.02972428.45465 33.64287 1 3.069701 3.620576 1 (speedup) 28.45465 33.64287
OptiMeshShallow Water Scalar Convection
1.0
33.6 28.5 0
0.2
0.4
0.6
0.8
1
1.2
No
rma
lize
d E
xe
cuti
on
Tim
e
28.5
33.6
0
0.01
0.02
0.03
0.04
Shallow Water
No
rma
lize
d E
xe
cuti
on
Tim
e
OptiMesh GPU
3.1
3.6
0
0.1
0.2
0.3
0.4
Scalar Convection
Liszt GPU
1P GPU GPU
1.0
3.6 3.1
0
0.2
0.4
0.6
0.8
1
1.2
Liszt CPU
Liszt GPU
OptiMesh
1P GPU GPU
1.0
3.9
8.0 44.9
8.1 54.4
0
0.2
0.4
0.6
0.8
1
1.2
1P 8P
No
rmal
ize
d E
xecu
tio
n T
ime
Scala Library
OptiDSLs No Cross Opt
OptiDSLs
1.0
6.4
17.3 51.1 31.4 79.9 0
0.2
0.4
0.6
0.8
1
1.2
1P 8P
No
rma
lize
d E
xecu
tio
n T
ime
Scala Library
OptiDSLs No Cross Opt
OptiDSLs
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
OptiDSLs
OptiDSLsNo Cross Opt
Scala Library
OptiDSLs
OptiDSLsNo Cross Opt
Scala Library
Normalized Execution Time
OptiQL
OptiGraph
OptiML
4.2
7.2
1.6
14.1
19.7
1 P
8 P
OptiQL vs. LINQ OptiCollections vs. Scala
OptiGraph vs. Green Marl OptiMesh vs. Liszt
Open World Composition
Small dataset Large dataset
TPCH Q1 TPCH Q2
Shallow Water Simulation Scalar Convection
Reverse web-link benchmark
Page Rank
Closed World Composition
Twitter Data Analysis