composition and reuse with compiled domain-specific languages · composition and reuse with...

1
Composition and Reuse with Compiled Domain-Specific Languages Pervasive Parallelism (PPL) Lab (http://ppl.stanford.edu ) and LAMP (http://lamp.epfl.ch ) Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun Tiark Rompf, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky Compiled Embedded Parallel DSLs Domain-Specific Languages for Heterogeneous Parallelism Contact information Open Source Repository http://stanfordppl.github.com/Delite/ Group Website http://ppl.stanford.edu E-mail Addresses (Students) [email protected], E-mail Address (Faculty Advisor) [email protected] DSL Re-Use Delite and Lightweight Modular Staging OptiLA: Linear Algebra Dense/Sparse Vector/Matrix Can be re-used & extended by other DSLs OptiML: Machine Learning Extends OptiLA Adds ML-specific constructs: e.g., TrainingSet, untilconverged OptiCollections: Parallel Collections Arrays, Maps - Parallel collection interface using Delite ops Designed to be used by other DSLs OptiQL: Data Querying In-memory collection-based querying Utilizes OptiCollections’ HashMap OptiMesh: Mesh-based PDEs Liszt on Delite More advanced DSL analyses and optimizations OptiGraph: Small-world Graphs Green-Marl on Delite Shares analyses / transformations with OptiMesh DSL Interoperability Evaluation OptiQL { // customers: Array[Customer] // orders: Array[Order] val orders = customers Join(orders) WhereEq(_.c_custkey, _.o_custkey) Select((c,o) => new Result { val nationKey = c.c_nationkey val acctBalance = c.c_acctbal val price = o.o_totalprice }) orderData.set(orders) } OptiML { // run linear regression on price val data = Matrix(orderData.get) val x = data.sliceCols(0,1) val y = data.getCol(2) theta.set(linreg.weighted(x,y).toArray) } println("theta: " + theta.get) DSL Delite Ops Generic Optimizations OptiML Map, ZipWith, Reduce, Foreach, Hash, Sort CSE, DCE, code motion, fusion OptiQL Map, Reduce, Filter, Sort, Hash, Join CSE, DCE, code motion, fusion OptiGraph ForeachReduce, Map, Reduce, Filter CSE, DCE, code motion OptiMesh ForeachReduce CSE, DCE, code motion OptiCollections Map, FlatMap, Reduce, Filter, Sort, Hash, ZipWith CSE, DCE, code motion, fusion Delite: Reusable infrastructure for rapid development of compiled embedded DSLs We modified the scala-virtualized compiler to introduce scopes, coarse-grained execution blocks that are compiled, staged and executed independently Scopes can communicate via global heap objects Lexer Parser Type checker Embedded front7end External front7end Full embedding (composable) Front7end embedding (not composable) Spectrum of compiled domain1specific languages Stand7alone (not composable) Back7end embedding (composable) Analysis OpCmizaCon Code gen Embedded back7end External back7end Why Compiled Domain-Specific Languages? How do we compose them? Delite Execu+on Graph Scala CUDA OpenCL gen ops parallel domain mul$view IR traversals generated code ... Code generators data structures Gen analyses Gen transformers Generic op+miza+on DS analyses DS transformers DS op+miza+on Example DSL: Vectors val (v1,v2) = (Vector.rand(1000), Vector.rand(1000)) val a = v1+v2 println(a) Lightweight Modular Staging: Embedded DSL compilation with metaprogramming All application types are wrapped in an abstract type constructor, Rep[T] The DSL returns type-safe placeholders instead of immediate results, allowing it to build an intermediate representation, optimize it, and generate code The DSL is embedded in Scala, a general purpose programming language The syntax is legal Scala When the program is executed, it builds an intermediate representation instead of computing a result (DSL compilation) The DSL implementation uses Delite ops to internally represent operations as parallel patterns: case class VectorPlus extends DeliteOpZipWith { def func = (a,b) => a + b } Delite generates code from the DSL IR to multiple platforms (C++, Scala, CUDA, OpenCL) Vector.rand Vector.rand VectorPlus Println Scala code C++ code CUDA code OpenCL code We want programs that satisfy the 3 P’s: Performance, Productivity, Portability Very hard to achieve with a general-purpose language: compilers lack semantic information necessary to efficiently translate to parallel hardware Domain-specific languages have been shown to be capable of providing the 3 P’s in some domains, but: Embedded DSLs are easy to build and compose, but don’t perform well Stand-alone DSLs are hard to build and compose, but perform well We propose using compiled DSLs embedded in a common back-end compiler Two main ways: Open World: fine-grained cooperative composition Embedded DSLs that are designed to be composed No whole program analyses or optimizations that rely on domain assumptions or restricted semantics DSL composition reduces to object composition in the host lang trait Vectors extends MathOps with ScalarOps with VectorOps trait Matrices extends Vectors with MatrixOps Closed World: coarse-grained isolated composition Embedded DSLs that require isolation to guarantee safety of optimizations 3 steps: 1) independently parse DSL blocks and apply domain- specific optimizations, 2) lower each DSL block to a common, high-level IR, 3) optimize and code generate resulting IR Mix-in composition 1.0 2.3 60.1 291 0 0.2 0.4 0.6 0.8 1 1.2 1P 8P Normalized Execution Time 5GB 1.0 4.6 3.7 21.1 0 0.2 0.4 0.6 0.8 1 1.2 1P 8P LINQ OptiQL 5GB 1.0 1.9 3.4 5.5 1.6 3.5 6.1 9.0 0 0.2 0.4 0.6 0.8 1 1.2 1P 2P 4P 8P Scala Parallel Collections OptiCollections 463 MB 1.0 1.7 2.4 2.8 3.4 4.3 6.8 10.5 0 0.2 0.4 0.6 0.8 1 1.2 1P 2P 4P 8P Normalized Execution Time 75 MB 1.7 1.7 1.7 0 0.2 0.4 0.6 0.8 1 1P 2P 4P 8P Normalized Execution Time 100k nodes x 800k edges 1.0 1.8 2.9 3.3 .75 1.5 2.9 3.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1P 2P 4P 8P Green Marl OptiGraph 8M nodes x 64M edges 1.0 33.6 28.5 0 0.2 0.4 0.6 0.8 1 1.2 Normalized Execution Time 1P GPU GPU 1.0 3.6 3.1 0 0.2 0.4 0.6 0.8 1 1.2 Liszt CPU Liszt GPU OptiMesh 1P GPU GPU 1.0 3.9 8.0 44.9 8.1 54.4 0 0.2 0.4 0.6 0.8 1 1.2 1P 8P Normalized Execution Time Scala Library OptiDSLs No Cross Opt OptiDSLs 1.0 6.4 17.3 51.1 31.4 79.9 0 0.2 0.4 0.6 0.8 1 1.2 1P 8P Normalized Execution Time Scala Library OptiDSLs No Cross Opt OptiDSLs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 OptiDSLs OptiDSLs No Cross Opt Scala Library OptiDSLs OptiDSLs No Cross Opt Scala Library Normalized Execution Time OptiQL OptiGraph OptiML 4.2 7.2 1.6 14.1 19.7 1 P 8 P OptiQL vs. LINQ OptiCollections vs. Scala OptiGraph vs. Green Marl OptiMesh vs. Liszt Open World Composition Small dataset Large dataset TPCH Q1 TPCH Q2 Shallow Water Simulation Scalar Convection Reverse web-link benchmark Page Rank Closed World Composition Twitter Data Analysis

Upload: duongngoc

Post on 20-Aug-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Composition and Reuse with Compiled Domain-Specific Languages · Composition and Reuse with Compiled Domain-Specific Languages Pervasive Parallelism (PPL) Lab () and LAMP () ... •

Composition and Reuse with Compiled Domain-Specific Languages

Pervasive Parallelism (PPL) Lab (http://ppl.stanford.edu) and LAMP (http://lamp.epfl.ch) Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun

Tiark Rompf, Aleksander Prokopec, Vojin Jovanovic, Martin Odersky

Compiled Embedded Parallel DSLs Domain-Specific Languages for Heterogeneous Parallelism

Contact information Open Source Repository  http://stanford-­‐ppl.github.com/Delite/   Group Website http://ppl.stanford.edu  

E-mail Addresses (Students) [email protected],  

E-mail Address (Faculty Advisor) [email protected]  

DSL Re-Use

Delite and Lightweight Modular Staging

OptiLA: Linear Algebra • Dense/Sparse Vector/Matrix • Can be re-used & extended by other DSLs

OptiML: Machine Learning • Extends OptiLA • Adds ML-specific constructs: e.g., TrainingSet, untilconverged

OptiCollections: Parallel Collections • Arrays, Maps - Parallel collection interface using Delite ops • Designed to be used by other DSLs

OptiQL: Data Querying • In-memory collection-based querying • Utilizes OptiCollections’ HashMap

OptiMesh: Mesh-based PDEs • Liszt on Delite • More advanced DSL analyses and optimizations

OptiGraph: Small-world Graphs • Green-Marl on Delite • Shares analyses / transformations with OptiMesh

DSL Interoperability

Evaluation

OptiQL  {      //  customers:  Array[Customer]      //  orders:  Array[Order]      val  orders  =  customers  Join(orders)            WhereEq(_.c_custkey,  _.o_custkey)          Select((c,o)  =>  new  Result  {                val  nationKey  =  c.c_nationkey                val  acctBalance  =  c.c_acctbal              val  price  =  o.o_totalprice          })      orderData.set(orders)  }    OptiML  {      //  run  linear  regression  on  price      val  data  =  Matrix(orderData.get)      val  x  =  data.sliceCols(0,1)      val  y  =  data.getCol(2)      theta.set(linreg.weighted(x,y).toArray)  }    println("theta:  "  +  theta.get)    

DSL Delite Ops Generic Optimizations

OptiML Map, ZipWith, Reduce, Foreach, Hash, Sort

CSE, DCE, code motion, fusion

OptiQL Map, Reduce, Filter, Sort, Hash, Join

CSE, DCE, code motion, fusion

OptiGraph ForeachReduce, Map, Reduce, Filter

CSE, DCE, code motion

OptiMesh ForeachReduce CSE, DCE, code motion

OptiCollections Map, FlatMap, Reduce, Filter, Sort, Hash, ZipWith

CSE, DCE, code motion, fusion

Delite: Reusable infrastructure for rapid development of compiled embedded DSLs

We modified the scala-virtualized compiler to introduce scopes, coarse-grained execution blocks that are compiled, staged and executed independently Scopes can communicate via global heap objects

Lexer% Parser% Type%checker%

Embedded%front7end%

External%front7end%Full%embedding%(composable)%Front7end%embedding%(not%composable)%

Spectrum)of)compiled)domain1specific)languages)

Stand7alone%(not%composable)% Back7end%embedding%(composable)%

Analysis% OpCmizaCon% Code%gen%

Embedded%back7end%

External%back7end%

Why Compiled Domain-Specific Languages? How do we compose them?

Delite&Execu+on&Graph&

Scala&CUDA&OpenCL&

gen&ops&parallel&

domain&

mul$view)IR) traversals) generated)code)

."."."

Code

&gen

erators&

data&structures&

Gen&analyses&

Gen&transformers&

Generic&op+miza+on&

DS&analyses&

DS&transformers&

DS&op+miza+on&

Example DSL: Vectors val  (v1,v2)  =  (Vector.rand(1000),  Vector.rand(1000))  val  a  =  v1+v2  println(a)  

Lightweight Modular Staging: Embedded DSL compilation with metaprogramming •  All application types are wrapped in an abstract type constructor, Rep[T]

•  The DSL returns type-safe placeholders instead of immediate results, allowing it to build an intermediate representation, optimize it, and generate code

•  The DSL is embedded in Scala, a general purpose programming language

•  The syntax is legal Scala

•  When the program is executed, it builds an intermediate representation instead of computing a result (DSL compilation)

•  The DSL implementation uses Delite ops to internally represent operations as parallel patterns: •  case  class  VectorPlus  extends  DeliteOpZipWith  {          def  func  =  (a,b)  =>  a  +  b        }  

•  Delite generates code from the DSL IR to multiple platforms (C++, Scala, CUDA, OpenCL)

Vector.rand Vector.rand

VectorPlus

Println

Scala code

C++ code

CUDA code

OpenCL code

•  We want programs that satisfy the 3 P’s: •  Performance, Productivity, Portability

•  Very hard to achieve with a general-purpose language: compilers lack semantic information necessary to efficiently translate to parallel hardware

•  Domain-specific languages have been shown to be capable of providing the 3 P’s in some domains, but: •  Embedded DSLs are easy to build and compose, but

don’t perform well •  Stand-alone DSLs are hard to build and compose, but

perform well •  We propose using compiled DSLs embedded in a

common back-end compiler

Two main ways: •  Open World: fine-grained cooperative composition

•  Embedded DSLs that are designed to be composed •  No whole program analyses or optimizations that rely on

domain assumptions or restricted semantics •  DSL composition reduces to object composition in the host lang  trait  Vectors  extends  MathOps  with  ScalarOps  with  VectorOps  trait  Matrices  extends  Vectors  with  MatrixOps  

•  Closed World: coarse-grained isolated composition •  Embedded DSLs that require isolation to guarantee safety of

optimizations •  3 steps: 1) independently parse DSL blocks and apply domain-

specific optimizations, 2) lower each DSL block to a common, high-level IR, 3) optimize and code generate resulting IR

Mix-in composition

OptiQL LINQ OptiQL LINQ OptiQL LINQ OptiQL LINQ0.91 52.73 0.188 19.29 182.7 663.76 31.54 154.56

55.65 22.83 182.03 668.7 31.82 140.0255.7 28.17 182.46 673.11 31.71 137.96

184.21 31.82183.36 31.68

0.91 54.69333 0.188 23.43 182.952 668.5233 31.714 144.18 (raw)0.016638 1 0.003437 0.428389 0.273666 1 0.047439 0.215669 (time)60.10256 1 290.922 2.334329 3.654091 1 21.07975 4.636727 (speedup)

total speedupgroup/soa 0.91 4.461538fusion 4.06 2.118227staging 8.6 2.161628array 18.59 2.065627lift date 38.4 1.424219iterable 54.69

OptiQLQ1 Q2

1 thread 8 threads 1 thread 8 threads

1.0

2.3

60.1 291 0

0.2

0.4

0.6

0.8

1

1.2

1P 8P

No

rmal

ize

d E

xecu

tio

n T

ime

5GB 1.0

4.6 3.7

21.1 0

0.2

0.4

0.6

0.8

1

1.2

1P 8P

LINQ

OptiQL

5GB

1 2 4 8 1 2 4 82.1414 1.6862 1.0656 0.6824 28.95 13.67 7.45 4.93

28.04 12.47 7.46 527.57 13.4 7.47 4.9927.62 12.71 7.36 5.0328.03 12.66 7.53 5.31

2.1414 1.6862 1.0656 0.6824 28.042 12.982 7.454 5.0520.297541 0.234292 0.148062 0.094817 0.614857 0.284647 0.163439 0.1107723.360885 4.268177 6.753941 10.5466 1.626394 3.513121 6.118505 9.02758

463 MB75 MBOptiCollections

1.0

1.9

3.4

5.5

1.6

3.5 6.1

9.0

0

0.2

0.4

0.6

0.8

1

1.2

1P 2P 4P 8P

Scala ParallelCollectionsOptiCollections

463 MB 1.0

1.7

2.4 2.8

3.4 4.3

6.8 10.5

0

0.2

0.4

0.6

0.8

1

1.2

1P 2P 4P 8P

No

rma

lize

d E

xecu

tio

n T

ime

75 MB

C++1 2 4 8 1 2 4 80.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.030.05 0.03 0.03 0.03

0.05 0.03 0.03 0.03 0.09 0.05 0.03 0.03 0.03 (raw)1 0.6 0.6 0.6 1.8 1 0.6 0.6 0.6 (time)1 1.666667 1.666667 1.666667 0.555556 1 1.666667 1.666667 1.666667 (speedup)

0.555556 0.333333 0.333333 0.333333 1 0.555556 0.333333 0.333333 0.333333 (time rel. c++)1.8 3 3 3 1 1.8 3 3 3 (speedup rel. c++)

100k x 800kGreen MarlOptiGraph

1.7 1.7 1.7

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

No

rma

lize

d E

xe

cuti

on

Tim

e

100k nodes x 800k edges

1.0

1.8

2.9 3.3

.75

1.5

2.9 3.6

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1P 2P 4P 8P

Green Marl

OptiGraph

8M nodes x 64M edges

GPU Liszt GPU Liszt CPU GPU Liszt GPU Liszt CPU2.6 2.199043 73.9821 1.641 1.39132 5.03738 (raw)

0.035144 0.029724 1 0.325765 0.276199 1 (time) 0.035144 0.02972428.45465 33.64287 1 3.069701 3.620576 1 (speedup) 28.45465 33.64287

OptiMeshShallow Water Scalar Convection

1.0

33.6 28.5 0

0.2

0.4

0.6

0.8

1

1.2

No

rma

lize

d E

xe

cuti

on

Tim

e

28.5

33.6

0

0.01

0.02

0.03

0.04

Shallow Water

No

rma

lize

d E

xe

cuti

on

Tim

e

OptiMesh GPU

3.1

3.6

0

0.1

0.2

0.3

0.4

Scalar Convection

Liszt GPU

1P GPU GPU

1.0

3.6 3.1

0

0.2

0.4

0.6

0.8

1

1.2

Liszt CPU

Liszt GPU

OptiMesh

1P GPU GPU

1.0

3.9

8.0 44.9

8.1 54.4

0

0.2

0.4

0.6

0.8

1

1.2

1P 8P

No

rmal

ize

d E

xecu

tio

n T

ime

Scala Library

OptiDSLs No Cross Opt

OptiDSLs

1.0

6.4

17.3 51.1 31.4 79.9 0

0.2

0.4

0.6

0.8

1

1.2

1P 8P

No

rma

lize

d E

xecu

tio

n T

ime

Scala Library

OptiDSLs No Cross Opt

OptiDSLs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

OptiDSLs

OptiDSLsNo Cross Opt

Scala Library

OptiDSLs

OptiDSLsNo Cross Opt

Scala Library

Normalized Execution Time

OptiQL

OptiGraph

OptiML

4.2

7.2

1.6

14.1

19.7

1 P

8 P

OptiQL vs. LINQ OptiCollections vs. Scala

OptiGraph vs. Green Marl OptiMesh vs. Liszt

Open World Composition

Small dataset Large dataset

TPCH Q1 TPCH Q2

Shallow Water Simulation Scalar Convection

Reverse web-link benchmark

Page Rank

Closed World Composition

Twitter Data Analysis