exploration of supervised machine learning …...exploration of supervised machine learning...

38
Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice University) Akihiro Hayashi (Rice University) Vivek Sarkar (Georgia Tech) 1

Upload: others

Post on 25-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU

Execution in Java Programs Gloria Kim (Rice University)

Akihiro Hayashi (Rice University) Vivek Sarkar (Georgia Tech)

1

Page 2: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Java For HPC?

Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM

2

4K x 4K Matrix Multiply GFLOPS Absolute speedup Python 0.005 1 Java 0.058 11

C 0.253 47 Parallel loops 1.969 366

Parallel divide and conquer 36.168 6,724 + vectorization 124.945 23,230 + AVX intrinsics 335.217 62,323

Strassen 361.681 67,243

Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015

Any ways to accelerate

Java programs?

Page 3: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Java 8 Parallel Streams APIs q Explicit Parallelism with lambda expressions

IntStream.range(0, N) .parallel() .forEach(i -> <lambda>);

3

Page 4: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Explicit Parallelism with Java q High-level parallel programming with Java"

offers opportunities for §  preserving portability

ü Low-level parallel/accelerator programming is not required §  enabling compiler to perform parallel-aware "

optimizations and code generation

Java 8 Parallel Stream API Multi-core

CPUs

Many-core

GPUs FPGAs

Java 8 Programs

HW

SW

4

Page 5: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Challenges for GPU Code Generation

5

IntStream.range(0, N).parallel().forEach(i -> <lambda>);

Challenge 1 Supporting Java

Features on GPUs

Challenge 3 CPU / GPU Selection

Challenge 2 Accelerating Java programs on GPUs

ü  Exception Semantics ü  Virtual Method Calls

ü  Kernel Optimizations ü  DT Optimizations

ü  Selection of a faster device from CPUs and GPUs

StandardJavaAPICallforParallelism

Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project

vs.

Page 6: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Related Work: Java + GPU

6

Lang JIT GPU Kernel Device Selection JCUDA Java - CUDA GPU only

Lime Lime ✔ Override map/reduce Static Firepile Scala ✔ reduce Static JaBEE Java ✔ Override run GPU only

Aparapi Java ✔ map Static Hadoop-CL Java ✔ Override map/reduce Static RootBeer Java ✔ Override run Not Described

HJ-OpenCL HJ - forall / lambda Static PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression

Our Work Java ✔ Parallel Stream Dynamic with Machine Learning

None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning

Page 7: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

JIT Compilation for GPU q IBM Java 8 Compiler

§  Built on top of the production version of the "IBM Java 8 runtime environment (J9 VM)

7

Multi-core

CPUs

Many-core

GPUs

method A

method A

method A

method A

Interpretation on JVM

1st invocation

2nd invocation

Nth invocation (N+1)th invocation

Native Code Generation for

Multi-core

CPUs

Page 8: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

The JIT Compilation Flow

8

Java Bytecode

Translation to Our IR

Parallel Streams Identification

Optimization for GPUs

NVVM IR libNVVM

Existing Optimizations

Target Machine Code Generation

GPU Binary

CPU Binary

GPU Runtime

Our JIT Compiler

NVIDIA’s Tool Chain

Page 9: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Performance Evaluations

9

40.6 37.4 82.0

64.2 27.6

1.4 1.0 4.4

36.7 7.4 5.7

42.7 34.6 58.1

844.7 772.3

1.0

0.1

1.9

1164.8

9.0 1.2

0.0

0.1

1.0

10.0

100.0

1000.0

10000.0

Spee

dup

rela

tive

to S

EQEN

TIAL

Ja

va (l

og s

cale

)

Higher is better 160 worker threads (Fork/join) GPU

CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla K40m

Page 10: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Runtime CPU/GPU Selection

10

Multi-core

CPUs

Many-core

GPUs

method A

method A Nth invocation

(N+1)th invocation Native Code Generation for

(Open Question) Which one is faster?

Page 11: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Our approach:ML-based Performance Heuristics

q A binary prediction model is constructed by supervised machine learning techniques

11

bytecode"App A

Prediction Model

JIT compilerfeature 1 data 1

bytecode"App Adata 2

bytecode"App Bdata 3

feature 2

feature 3

ML

Training run with JIT Compiler Offline Model Construction

feature extraction

feature extraction

feature extraction

Java Runtime

CPU GPU

Page 12: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Features of program q Loop Range (Parallel Loop Size) q The dynamic number of Instructions in IR

§  Memory Access §  Arithmetic Operations §  Math Methods §  Branch Instructions §  Other Types of Instructions

12

Page 13: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Features of program (Cont’d) q The dynamic number of Array Accesses

§ Coalesced Access (a[i]) § Offset Access (a[i+c]) §  Stride Access (a[c*i]) § Other Access (a[b[i]])

13

0 5 10 15 20 25 30

050

100

150

200

c : offset or stride sizeBa

ndw

idth

(GB/

s)

● ● ●● ● ● ●

● ● ● ● ● ● ● ●

●● ●

●● ● ● ● ● ●

● ● ● ● ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Offset : array[i+c]Stride : array[c*i]

Page 14: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

An example of feature vector

14

"features" : { "range": 256, "ILs" : { "Memory": 9, "Arithmetic": 7, "Math": 0, "Branch": 1, "Other": 1 }, "Array Accesses" : { "Coalesced": 3, "Offset": 0, "Stride": 0, "Random": 0}, }

IntStream.range(0, N) .parallel() .forEach(i -> { a[i] = b[i] + c[i];});

Page 15: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Offline Prediction Model Construction q  Obtained 291 samples by running 11 applications with different data sets

§  Additional 41 samples by running additional 3 applications q  Choice is either GPU or 160 worker threads on CPU

15

bytecode"App A

Prediction Model

feature 1 data 1

bytecode"App Adata 2

bytecode"App Bdata 3

feature 2

feature 3

ML Algorithms

feature extraction

feature extraction

feature extraction

Java Runtime

Training run with JIT Compiler Offline Model Construction

Page 16: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Platform q CPU

§  IBM POWER8 @ 3.69GHz ü 20 cores ü 8 SMT threads per core = up to 160 threads ü 256 GB of RAM

q GPU §  NVIDIA Tesla K40m

ü 12GB of Global Memory

16

Page 17: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Applications

17

Application Source Field Max Size Data Type BlackScholes Finance 4,194,304 double

Crypt JGF Cryptography Size C (N=50M) byte

SpMM JGF Numerical Computing Size C (N=500K) double

MRIQ Parboil Medical Large (64^3) float Gemm Polybench

Numerical Computing

2K x 2K int Gesummv Polybench 2K x 2K int Doitgen Polybench 256x256x256 int

Jacobi-1D Polybench N=4M, T=1 int Matrix Multiplication 2K x 2K double

Matrix Transpose 2K x 2K double VecAdd 4M double

Page 18: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Explored ML Algorithms q Support Vector Machines (LIBSVM) q Decision Trees (Weka 3) q Logistic Regression (Weka 3) q Multilayer Perceptron (Weka 3) q k Nearest Neighbors (Weka 3) q Decision Stumps (Weka 3) q Naïve Bayes (Weka 3)

18

Page 19: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Top 3 ML AlgorithmsFull Set of Features (10 Features)

19

98.28 98.28 97.6099.66 98.97 98.9798.28 98.68 92.68

0

20

40

60

80

100

LIBSVM J48 Tree Logistic Regression

Acc

urac

y (%

)

Higher is betterAccuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data

Page 20: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Further Explorations and Analyses by Feature Subsetting q Research Questions

§  Are the 10 features the best combination? ü Accuracy ü Runtime prediction overheads

§ Which features are more important?

20

Page 21: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Feature Sub-Setting for further analyses: 10 Algorithms x 1024 Subsets

21

1. Loop Range (Parallel Loop Size)2. # of Memory Accesses3. # of Arithmetic Operations4. # of Math Methods5. # of Branch Instructions6. # of Other Types of Instructions7. # of Coalesced Memory Accesses8. # of Offset Memory Accesses9. # of Stride Memory Accesses10. # of Other Types of Array Accesses

10 FeaturesSubset1

Subset2

Subset1024

SVM

Decision Trees

Logistic Regression

Multi Layer Perceptron

kNN

Decision Stumps

Naïve Bayes

AccuracyX%

AccuracyY%

AccuracyZ%

AccuracyA%

AccuracyB%

AccuracyC%

AccuracyD%

Page 22: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

The impact of Subsetting(Fullset vs. the best subset)

22

99.656 98.625 98.28298.282 97.595 98.282

0

20

40

60

80

100

LIBSVM Logistic Regression J48 Tree

Acc

urac

y (%

)

Higher is betterSubset of Features Fullset of features

Feature subsetting can yield better accuracies

Page 23: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

The impact of Subsetting(Fullset vs. the best subset)

23

99.66 98.63 98.2899.66 98.97 98.2899.6682.93

92.68

0102030405060708090

100

LIBSVM (# of features = 3) Logistic Regression (# of features = 4) J48 Tree (# of features = 2)

Acc

urac

y (%

)

Higher is betterAccuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data

Subsets with only a few features show great accuracies

Page 24: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

An example of Subsetting analyses

24

q With LIBSVM, 99 subsets achieved the highest accuracy Feature # of Models with this feature PercentageofModelswiththisfeature

range 99 100.0%stride 96 97.0%

arithmetic 65 65.7%Other 2 56 56.6%memory 56 56.6%

offset 55 55.6%branch 54 54.5%math 46 46.5%

Other 1 43 43.4%

Page 25: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Runtime Prediction Overheads LIBSVM Logistic

RegressionJ48 Decision

Trees

Full Set (10 features) 2.278 usec 0.158 usec 0.020 usec

Subset 2.107 usec(3 Features)

0.106 usec(4 Features)

0.020 usec(2 Features)

25

q  Subsetting can reduce prediction overheads q  However, this part is very small compared to the kernel execution part

§  Based on our analysis, kernels usually take a few milliseconds

Page 26: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Lessons Learned q ML-based CPU/GPU Selection is a promising

way to choose faster devices q LIBSVM, Logistic Regression, and J48

Decision Tree are machine learning techniques that produce models with best accuracies

q Range, coalesced, other types of array accesses, and arithmetic instructions are particularly important features

26

Page 27: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Lessons Learned (Cont’d) q While LIBSVM (Non-linear classification) shows

excellent accuracy in prediction, runtime prediction overheads are relatively larger compared to other algorithms §  However, those overheads are negligible in general

q J48 Decision Tree shows comparable accuracy to LIBSVM. Also, the output of the J48 Decision Tree is more human-readable and fine-tunable

27

Page 28: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Conclusions q Conclusions

§  ML-based CPU/GPU Selection is a promising way to choose faster devices

q Future Work §  Evaluate End-to-end performance improvements §  Explore the possibility of applying our technique to

OpenMP, OpenACC, and OpenCL §  Perform experiments on recent versions of GPUs

28

Page 29: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Further Readings q Code Generation and Optimizations

§  “Compiling and Optimizing Java 8 Programs for GPU Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar. 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), October 2015.

q  Performance Heuristics (SVM-based) §  “Machine-Learning-based Performance Heuristics for

Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki Ishizaki, Gita Koblents, Vivek Sarkar. 12th International Conference on the Principles and Practice of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ), September 2015.

29

Page 30: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Backup Slides

30

Page 31: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Performance Breakdown

31

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

BASE

+=

LV

+=D

T +=

ALIG

N

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

) BA

SE

+=LV

+=

DT

+=AL

IGN

ALL(

+=R

OC

)

BlackScholes Crypt SpMM Series MRIQ MM Gemm Gesummv

Exec

utio

n Ti

me

norm

aliz

ed to

ALL

Lower is better H2D D2H Kernel

Page 32: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Precisions and Recalls with cross-validation

32

Precision CPU

Recall CPU

Precision GPU

Recall GPU

Range 79.0% 100% 0% 0%

+=nIRs 97.8% 99.1% 96.5% 91.8%

+=dIRs 98.7% 100% 100% 95.0%

+=Arrays 98.7% 100% 100% 95.0%

ALL 96.7% 100% 100% 86.9%

Higherisbe6er

Allpredic*onmodelsexceptRangerarelymakeabaddecision

Page 33: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Java Features on GPUs q Exceptions

§  Explicit exception checking on GPUs ü ArrayIndexOutOfBoundsExceptionü NullPointerException ü ArithmeticException (Division by zero only)

§  Further Optimizations ü Loop Versioning for redundant exception checking elimination on

GPUs based on [Artigas+, ICS’00] q Virtual Method Calls

§  Direct De-virtualization or Guarded De-virtualization "[Ishizaki+, OOPSLA’00]

33

Challenge 1 : Supporting Java Features on GPUs

Page 34: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Loop Versioning Example

34

IntStream.range(0, N) .parallel() .forEach( i -> { Array[i] = i / N; } );

// Speculative Checking on CPU if (Array != null && 0 <= Array.length && N <= Array.length && N != 0) { // Safe GPU Execution } else {

May cause OutOfBoundsException

May cause ArithmeticException

May cause NullPointerException

Challenge 1 : Supporting Java Features on GPUs

Page 35: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Optimizing GPU Execution q Kernel optimization

§ Optimizing alignment of Java arrays on GPUs §  Utilizing “Read-Only Cache” in recent GPUs

q Data transfer optimizations §  Redundant data transfer elimination §  Partial transfer if an array subscript is affine

35

Challenge 2 : Accelerating Java programs on GPUs

Page 36: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

The Alignment Optimization Example

36

0 128

a[0]-a[31]

Object header

Memory address

a[32]-a[63] Naive"alignment"strategy

a[0]-a[31] a[32]-a[63]

256 384

Our"alignment"strategy

One memory transaction for a[0:31]

Two memory transactions for a[0:31]

a[64]-a[95]

a[64]-a[95]

Challenge 2 : Accelerating Java programs on GPUs

Page 37: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

Performance Evaluations

37

40.6 37.4 82.0

64.2 27.6

1.4 1.0 4.4

36.7 7.4 5.7

42.7 34.6 58.1

844.7 772.3

1.0

0.1

1.9

1164.8

9.0 1.2

0.0

0.1

1.0

10.0

100.0

1000.0

10000.0

Spee

dup

rela

tive

to S

EQEN

TIAL

Ja

va (l

og s

cale

)

Higher is better 160 worker threads (Fork/join) GPU

CPU: IBM POWER8 @ 3.69GHz , GPU: NVIDIA Tesla K40m

Challenge 2 : Accelerating Java programs on GPUs

Page 38: Exploration of Supervised Machine Learning …...Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs.GPU Execution in Java Programs Gloria Kim (Rice

How to evaluate prediction models:5-fold cross validation q  Overfitting problem:

§  Prediction model may be tailored to the eleven applications "if training data = testing data

q  To avoid the overfitting problem: §  Calculate the accuracy of the prediction model using 5-fold cross

validation

38

Subset 1 Subset 2 Subset 3 Subset 4 Subset 5

Subset 2 Subset 1 Subset 3 Subset 4 Subset 5 Build a prediction model trained on Subset 2-5 Used for TESTING

Accuracy : X%

Used for TESTING Accuracy : Y%

TRAINING DATA (291 samples)

Build a prediction model trained on Subset 1, 3-5