apache systemml architecture by niketan panesar

SystemMLArchitectureNiketanPansare,BertholdReinwald

July25th,2016

Agenda

• High-levelDesign&APIs• ArchitectureOverview• Tooling• Importantlinks

2

Fromhttp://systemml.apache.org/

Agenda

• High-levelDesign&APIs• ArchitectureOverview

• Language• Compiler• Runtime

• Twoexamples:• SimpleDMLexpressionwithanexampledataset• LinearRegressionwithvaryingdatasizes

• Tooling• Importantlinks

3

Agenda





4

SystemML Design

5

DML (Declarative Machine Learning Language)

Hadoop or Spark Cluster (scale-out)

since 2010

In-Memory Single Node (scale-up)

since 2012 since 2015

DML Scripts

Data

CP+bsb_mVar1SPARKmapmmX_mvar1_mVar2RIGHTfalseNONECP*y_mVar2_mVar3

Hybridexecutionplans*

SystemML3.double [][]

1.Ondisk/HDFS

2.RDD/DataFrame

SystemML Design

6


since 2010


since 2012

DML Scripts

Data

SystemML

1.Ondisk/HDFS

2.RDD/DataFrame

3.double [][]

Command line API*

(also MLContext*)

-exechadoop

SystemML Design

7



since 2012

DML Scripts

Data

SystemML

1.Ondisk/HDFS

2.RDD/DataFrame

3.double [][]

Twooptions:1. –execsinglenode2. Usestandalonejar(preservesrewrites,but

mayspawnLocalMRjobs)

Command line API*

(also MLContext*)

SystemML Design

8

Spark Cluster (scale-out)



DML Scripts

Data

SystemML

1.Ondisk/HDFS

2.RDD/DataFrame

3.double [][]

Command line API*

(also MLContext*)

SystemML Design

9

Spark Cluster (scale-out)



DML Scripts

Data

SystemML

1.Ondisk/HDFS

2.RDD/DataFrame

3.double [][]

MLContext API- Java/Python/Scala

https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html

SystemML Design

10


since 2012

DML Scripts

Data

SystemML

1.Ondisk/HDFS

2.RDD/DataFrame

3.double [][]

JMLC API

https://apache.github.io/incubator-systemml/jmlc.html

Agenda





11

From DML to Execution Plan

12



DML Scripts DML (Declarative Machine Learning Language)

since 2010since 2012 since 2015

Data



SystemML


13



Runtime

Compiler

Language



Data



Assuming anexampledatasetX:100MX500,y:100MX1,b/sb:500X1

SystemML Compilation Chain

14


15

• Parsing• Parse input DML/PyDML using Antlr v4 (see Dml.g4 and Pydml.g4) • Perform syntactic validation• Construct DMLProgram (=> list of Statement and function blocks)

• Live Variable Analysis• Classic dataflow analysis• A variable is “live” if it holds value that may be needed in future• Dead code elimination

• Semantic Validation


16

• Dataflow in DAGs of operations on matrices, frames, and scalars• Choosing from alternative execution plans based on memory and cost estimates• Operator ordering & selection; hybrid plans


17

*DiscussedlaterinTooling

spark-submit--masteryarn-client --driver-memory20G--num-executors 4--executor-memory 40G--executor-cores 24SystemML.jar-ftest.dml-explain hops


18

• Low-level physical execution plan (LOPDags) • Over key-value pairs for MR• Over RDDs for Spark

• “Piggybacking” operations into minimal number Map-Reduce jobs


19

Spark

CP + b sb _mVar1SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE

_mVar2.MATRIX.DOUBLE RIGHT false NONECP * y _mVar2 _mVar3

SystemML Runtime• Hybrid Runtime• CP: single machine operations & orchestrate jobs• MR: generic Map-Reduce jobs & operations• SP: Spark Jobs

• Numerically stable operators• Dense / sparse matrix representation• Multi-Level buffer pool (caching) to evict in-memory

objects• Dynamic Recompilation for initial unknowns

Control Program

RuntimeProgram

BufferPool

ParFor Optimizer/Runtime

MRInstSpark

InstCPInst

Recompiler

DFSIOMem/FSIO

GenericMRJobs

MatrixBlock Library(single/multi-threaded)


21



Runtime

Compiler

Language



Data



Varyingdatasizes

LinearRegression.dml

ADataScientist– LinearRegression

22

X ≈

Explanatory/Independent Variables

Predicted/Dependant VariableModel

w

w = argminw ||Xw-y||2 + λ||w||2Optimization Problem:

next direction

Iterateuntilconvergence

initialize

stepsize

update w

initialdirection

accuracymeasures

Conjugate Gradient Method:• Start off with the (negative) gradient• For each step

1. Move to the optimal point along the chosen direction;2. Recompute the gradient;3. Project it onto the subspace conjugate* to all prior directions; 4. Use this as the next direction

(* conjugate = orthogonal given A as the metric)

A = XT X + λ

y

SystemML– RunLinRegCGonSpark

23

100M

10,000

100M

1yX

100M

1,000X

100M

100X

100M

10X

100M

1y

100M

1y

100M

1y

8 TB

800 GB

80 GB

8 GB …tMMp

…

Multithreaded Single Node

20 GB Driver on 16c6 x 55 GB Executors

Hybrid Planwith RDD caching

and fused operator

Hybrid Planwith RDD out-of-core and fused

operator

Hybrid Planwith RDD out-of-core and different

operators

…x.persist();

...

X.mapValues(tMMv)

.reduce ()…

Driver

Fused

Executors

…

RDDcache:X

tMMv tMMv

…x.persist();

...

X.mapValues(tMMv).reduce()

...Executors

…

RDDcache:X

tMMv tMMv

Driver

Spilling

…x.persist();

...// 2 MxV mult

// with broadcast,// mapToPair, and

// reduceByKey... Executors

…

RDDcache:X

Mv

tvM

Mv

tvM

Driver

Driver

Cache

Agenda

• ArchitectureOverview• Language&APIs• Compiler• Runtime



24

SystemML’sCompilationChain/OverviewTools

25

EXPLAINhops

STATS

DEBUG

EXPLAINruntime

[MatthiasBoehmetal:SystemML'sOptimizer:PlanGenerationforLarge-ScaleMachineLearningPrograms.IEEEDataEng.Bull2014]

HOP (High-leveloperator)LOP(Low-leveloperator)

EXPLAIN*_recompile

Explain(UnderstandingExecutionPlans)• Overview

• Showsgeneratedexecutionplan(atdifferentcompilationsteps)• Introduced05/2014forinternalusage• Importanttoolforunderstanding/debugging optimizerchoices!

• Usage• hadoop jar SystemML.jar -f test.dml –explain

[hops | runtime | hops_recompile | runtime_recompile]• Hops

• Programw/hopdagsafteroptimization• Runtime(default)

• Programw/generatedruntime instructions• Hops_recompile:

• Seehops+hopdagaftereveryrecompile• Runtime_recompile:

• Seeruntime+generatedruntime instructionsaftereveryrecompile

26

Explain:UnderstandingHOPDAGs(simpleDML)

27

Spark• HOPID• HOPopcode• HOPinputdatadependencies (viaHOPIDs)• HOPoutputmatrixcharacteristics(rlen,clen,brlen,bclen,nnz)• Hopmemoryestimates(allinputs, intermediates,outputà

operationmem)• Hopexecutiontype(CP/SP/MR)• Optional:indicatorsofreblock/checkpointing (caching)ofhop

outputs

-explainhops-explainrecompile_hops

spark-submit--masteryarn-client --driver-memory20G--num-executors 4--executor-memory 40G--executor-cores 24SystemML.jar-ftest.dml-explain hops

Broadcastmembudget

Explain:UnderstandingHOPDAGs(entirescript)

• ExampleDMLScript(SimplifiedLinregDS)

28

X = read($1);y = read($2);intercept = $3; lambda = $4;

if( intercept == 1 ) {ones = matrix(1, nrow(X), 1); X = append(X, ones);

}

I = matrix(1, ncol(X), 1);A = t(X) %*% X + diag(I*lambda);b = t(X) %*% y;beta = solve(A, b);

write(beta, $5);

Invocation:hadoop jar SystemML.jar -f linregds.dml -args X y 0 0 beta

Scenario:X:100,000x1,000,1.0y:100,000x1,1.0(800MB,200+GFlop)

Explain:UnderstandingHOPDAGs(2)• ExplainHops

29

15/07/05 17:18:06 INFO api.DMLScript: EXPLAIN (HOPS):# Memory Budget local/remote = 57344MB/1434MB/1434MB# Degree of Parallelism (vcores) local/remote = 24/144/72PROGRAM--MAIN PROGRAM----GENERIC (lines 1-4) [recompile=false]------(10) PRead X [100000,1000,1000,1000,100000000] [0,0,763 -> 763MB], CP------(11) TWrite X (10) [100000,1000,1000,1000,100000000] [763,0,0 -> 763MB], CP------(21) PRead y [100000,1,1000,1000,100000] [0,0,1 -> 1MB], CP------(22) TWrite y (21) [100000,1,1000,1000,100000] [1,0,0 -> 1MB], CP------(24) TWrite intercept [0,0,-1,-1,-1] [0,0,0 -> 0MB], CP------(26) TWrite lambda [0,0,-1,-1,-1] [0,0,0 -> 0MB], CP----GENERIC (lines 11-16) [recompile=false]------(42) TRead X [100000,1000,1000,1000,100000000] [0,0,763 -> 763MB], CP------(52) r(t) (42) [1000,100000,1000,1000,100000000] [763,0,763 -> 1526MB]------(53) ba(+*) (52,42) [1000,1000,1000,1000,-1] [1526,8,8 -> 1541MB], CP------(43) TRead y [100000,1,1000,1000,100000] [0,0,1 -> 1MB], CP------(59) ba(+*) (52,43) [1000,1,1000,1000,-1] [764,0,0 -> 764MB], CP------(60) b(solve) (53,59) [1000,1,1000,1000,-1] [8,8,0 -> 15MB], CP------(66) PWrite beta (60) [1000,1,-1,-1,-1] [0,0,0 -> 0MB], CP

ClusterCharacteristics

ProgramStructure(inclrecompile)

UnrolledHOPDAG

Notes:ifbranch(6-9)andregularization removedbyrewrites

Explain:UnderstandingRuntimePlans(1)• ExplainRuntime(simplifiedfilenames,removedrmvar)

30 IBMResearch

15/07/05 17:18:53 INFO api.DMLScript: EXPLAIN (RUNTIME):# Memory Budget local/remote = 57344MB/1434MB/1434MB# Degree of Parallelism (vcores) local/remote = 24/144/72PROGRAM ( size CP/MR = 25/0 )--MAIN PROGRAM----GENERIC (lines 1-4) [recompile=false]------CP createvar pREADX X false binaryblock 100000 1000 1000 1000 100000000------CP createvar pREADy y false binaryblock 100000 1 1000 1000 100000------CP assignvar 0.SCALAR.INT.true intercept.SCALAR.INT------CP assignvar 0.0.SCALAR.DOUBLE.true lambda.SCALAR.DOUBLE------CP cpvar pREADX X------CP cpvar pREADy y----GENERIC (lines 11-16) [recompile=false]------CP createvar _mVar2 .../_t0/temp1 true binaryblock 1000 1000 1000 1000 -1------CP tsmm X.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE LEFT 24------CP createvar _mVar3 .../_t0/temp2 true binaryblock 1 100000 1000 1000 100000------CP r' y.MATRIX.DOUBLE _mVar3.MATRIX.DOUBLE------CP createvar _mVar4 .../_t0/temp3 true binaryblock 1 1000 1000 1000 -1------CP ba+* _mVar3.MATRIX.DOUBLE X.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 24------CP createvar _mVar5 .../_t0/temp4 true binaryblock 1000 1 1000 1000 -1------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE------CP createvar _mVar6 .../_t0/temp5 true binaryblock 1000 1 1000 1000 -1------CP solve _mVar2.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE------CP write _mVar6.MATRIX.DOUBLE .../beta.SCALAR.STRING.true textcell.SCALAR.STRING.true

Literallyastringrepresentationofruntime instructions

Stats(ProfilingRuntimeStatistics)

• Overview• Profilesandshowsaggregatedruntimestatisticsofpotentialbottlenecks• Introduced01/2014forinternalusage,extensionofbufferpoolstats01/2013• Importanttoolforunderstandingruntimecharacteristicsandprofiling/tuningsysteminternalsbydevelopers

• Usage• hadoop jar SystemML.jar -f test.dml -stats

31 IBMResearch

SystemML Statistics

Totalexectime

Bufferpool stats

Dynamicrecompilationstats

JVMstats(JIT,GC)

Heavyhitterinstructions(incl.bufferpool times)

optional:parforstats(ifprogramcontainsparfors)

Debug(ScriptDebugging)• Overview

• Script-leveldebuggingbyend-users(anddevelopers)• Introduced09/2014asresultofinternproject• gdb-inspiredcommand-linedebuggerinterface

• Usage• hadoop jar SystemML.jar -f test.dml -debug

33

Agenda

• ArchitectureOverview• Language&APIs• Compiler• Runtime



34

Important Links• Website:http://systemml.apache.org/

35

Important Links• Website:http://systemml.apache.org/

• InterestedinSystemML?• Gotohttps://github.com/apache/incubator-systemml and“Starit”

36

Important Links• Website: http://systemml.apache.org/• Interested in SystemML ?

• Go to https://github.com/apache/incubator-systemml and “Star it”• Want to contribute to SystemML ?

• See http://apache.github.io/incubator-systemml/contributing-to-systemml.html

• List of issues: https://issues.apache.org/jira/browse/SYSTEMML/• Ask any of our PMC members for suggestions

• Want to try out SystemML ?• Laptop: http://apache.github.io/incubator-systemml/quick-start-guide.html

(Does not require Hadoop/Spark installation)• Spark Cluster: http://apache.github.io/incubator-systemml/spark-

mlcontext-programming-guide.html (Includes Jupyter/Zeppelin demo)

37

ThankYou

apache systemml architecture by niketan panesar

Education