transparent gpu exploitation for java

50
Keynote at The Fourth International Symposium on Computing and Networking (CANDAR’16) Kazuaki Ishizaki IBM Research – Tokyo Transparent GPU Exploitation for Java 1

Upload: kazuaki-ishizaki

Post on 14-Apr-2017

278 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Transparent GPU Exploitation for Java

Keynote at The Fourth International Symposium on Computing and Networking (CANDAR’16)

Kazuaki Ishizaki

IBM Research – Tokyo

Transparent GPU Exploitation for Java

1

Page 2: Transparent GPU Exploitation for Java

My Research History

1992-1995 Static compiler for High Performance Fortran

1996-now Just-in-time compiler for IBM Developers Kit for

Java–1996-2000 Benchmark and GUI applications

–2000-2010 Web and Enterprise applications

–2012- Analytics applications 2014- Java language with GPUs

2015- Apache Spark (in-memory data processing framework)

with GPUs

2 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 3: Transparent GPU Exploitation for Java

My Research History

1990-1992 My master thesis with FPGA–Used XC3000 series with schematic editor Verilog and VHDL were just available

1992-1995 Static compiler for High Performance Fortran

1996-now Just-in-time compiler for IBM Developers Kit for

Java

3 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 4: Transparent GPU Exploitation for Java

Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki

What has Happened in HPC from 1995 to 2016 Program is becoming simpler

Hardware is becoming complicated

1995 2016

Hardware Fast scalar processors Commodity processors with hardware

accelerators

Applications Weather, wind, fluid, and

physics simulations

Machine learning and

deep learning with big data

Program Complicated and

hardware-dependent code

Simple and clean code

(e.g. mapreduce by Hadoop)

Users Limited to programmers

who are well-educated for HPC

Data scientists

who are non-familiar with hardware

Hardware

Example4

GPUPowerPC

Page 5: Transparent GPU Exploitation for Java

Quiz: Can this program be executed in parallel?

5 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {

a[idx[i]] = i;}

}

Page 6: Transparent GPU Exploitation for Java

Answer: Depend on idx[]Can this program be executed in parallel?

6 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {

a[idx[i]] = i;}

}

idx = {0, 1, 2, 3, …} idx = {0, 1, 0, 3, …}

Execute in parallel Execute sequentially

Page 7: Transparent GPU Exploitation for Java

How Can We Know idx[]? (Word-based) Transactional memory

Parallelization analysis at–Compilation time: Not easy

–Runtime: Require much time

7 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {

a[idx[i]] = i;}

}

Page 8: Transparent GPU Exploitation for Java

What We Want To Ask ProgrammerProgrammer usually knows everything

8 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

void test(float a[], int idx[], int N) {#pragma parallel for (int i = 0; i < N; i++) {

a[idx[i]] = i;}

}

idx = {0, 1, 2, 3, …}

Page 9: Transparent GPU Exploitation for Java

What We Do Not Want To Ask ProgrammerWhat Hardware Will This Program Use?

–CPU?

–GPU?

–FPGA?

–ASIC?

9 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 10: Transparent GPU Exploitation for Java

My Recent InterestHow system generates hardware accelerator code from

program with high-level abstraction–Expected (practical) result

People execute program without knowing usage of hardware accelerator

–Challenge How to optimize code for a certain hardware accelerator without specific

information

–On-going research GPU exploitation from Java program

GPU exploitation in Apache Spark

work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,

Madhusudanan Kandasamy , Gita Koblents -, Moriyoshi Ohara +,

Vivek Sarkar *, and Jan Wroblewski (intern) +

+ IBM Research – Tokyo, - IBM Canada, IBM India, * Rice University

10 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 11: Transparent GPU Exploitation for Java

GPU Exploitation from Java Program

Page 12: Transparent GPU Exploitation for Java

Why Java for GPU Programming?High productivity

– Safety and flexibility

–Good program portability among different machines “write once, run anywhere”

–One of the most popular programming languages Hard to use CUDA and OpenCL for non-expert programmers

Many computation-intensive applications in non-HPC area–Data analytics and data science (Hadoop, Spark, etc.)

– Security analysis (events in log files)

–Natural language processing (messages in social network system)

12 Transparent GPU Exploitation for Java / Kazuaki IshizakiFrom https://www.flickr.com/photos/dlato/5530553658

CUDA is a programming language for GPU offered by NVIDIA

Page 13: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

How We Write GPU Program Five steps

1. Allocate GPU device memory

2. Copy data on CPU main memory

to GPU device memory

3. Launch a GPU kernel to be executed

in parallel on cores

4. Copy back data on GPU device

memory to CPU main memory

5. Free GPU device memory

device memory

(up to 16GB)main memory

(up to 1TB/socket)

CPU GPU

Data copy over

PCIe or NVLink

dozen cores/socket thousands cores

13

Page 14: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

How We Optimize GPU Program

device memory

(up to 16GB)main memory

(up to 1TB/socket)

CPU GPUdozen cores/socket thousands cores

14

Exploit faster memory• Read-only cache (Read only)

• Shared memory (SMEM)

Data copy over

PCIe or NVLink

From GTC presentation by NVIDIA

Reduce data copy

Five steps 1. Allocate GPU device memory

2. Copy data on CPU main memory

to GPU device memory

3. Launch a GPU kernel to be executed

in parallel on cores

4. Copy back data on GPU device

memory to CPU main memory

5. Free GPU device memory

Page 15: Transparent GPU Exploitation for Java

Fewer Code Makes GPU Programming Easy

Current programming model requires programmers to

explicitly write operations for–managing device memories

– copying data

between CPU and GPU

– expressing parallelism

– exploiting faster memory

Java 8 enables programmers

to just focus on– expressing parallelism

15 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

void fooCUDA(N, float *A, float *B, int N) {int sizeN = N * sizeof(float);cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);cudaMemcpy(d_A, A, sizeN, Host2Device);GPU<<<N, 1>>>(d_A, d_B, N);cudaMemcpy(B, d_B, sizeN, Device2Host);cudaFree(d_B); cudaFree(d_A);

}// code for GPU__global__ void GPU(float* d_A, float* d_B, int N) {

int i = threadIdx.x;if (N <= i) return;d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache

}

void fooJava(float A[], float B[], int N) {// similar to for (idx = 0; i < N; i++)IntStream.range(0, N).parallel().forEach(i -> {

B[i] = A[i] * 2.0;});

}

Page 16: Transparent GPU Exploitation for Java

GoalBuild a Java just-in-time (JIT) compiler to generate high

performance GPU code from a parallel loop construct

Implementing four performance optimizations

Offering performance evaluations on POWER8 with a GPU

Supporting Java language feature (See [PACT2015])

Predicting Performance on CPU and GPU [PPPJ2015]

Available in IBM Java 8 ppc64le and x86_64– https://www.ibm.com/developerworks/java/jdk/java8/

16 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Accomplishments

Page 17: Transparent GPU Exploitation for Java

Parallel Programming in Java 8 Express parallelism by using parallel stream API among

iterations of a lambda expression (index variable: i)

17 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

class Par {void foo(float[] a, float[] b, float[] c, int n) {java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

});}}

Reference implementation of Java 8 can execute this on multiple CPU threadsi =0 on thread 0

i = 3 on thread 1

i = 4 on thread 2

i = 1 on thread 3

i = 2 on thread 0

time

Page 18: Transparent GPU Exploitation for Java

Portability among Different Hardware

A just-in-time compiler in IBM Java 8 runtime generates

native instructions– for a target machine including GPUs from Java bytecode

– for GPU which exploit device-specific capabilities more easily than

OpenCL

18

Javaprogram(.java)

Java bytecode(.class,.jar)

IBM Java 8 runtime

Target machine

Interpreter

just-in-time compiler

> javac Par.java > java Par for GPU

IntStream.range(0, n).parallel().forEach(i -> { ...

});

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 19: Transparent GPU Exploitation for Java

Overview of Our JIT Compiler Java bytecode

sequence is divided

into two intermediate

presentation (IR) parts– Lambda expression:

generate GPU code

using NVIDIA tool chain

(right hand side)

–Others:

generate CPU code

using conventional JIT

compiler (left hand side)

19 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

NVIDIA GPU binaryfor lambda expression

CPU binary for- managing device memory- copying data- launching GPU binary

Conventional Java JIT compiler

Parallel stream APIs detection

// Parallel stream codeIntStream.range(0, n).parallel()

.forEach(i -> { ...c[i] = a[i]...});

IR for GPUs...

c[i] = a[i]...

IR for CPUs

Java bytecode

CPU native code generator GPU native code

generator (by NVIDIA)

Additional modules for GPU

GPUs optimizations

Page 20: Transparent GPU Exploitation for Java

Optimizations for GPU in Our JIT CompilerOptimizing alignment of Java arrays on GPUs

– Reduce # of memory transactions to a GPU global memory

Using read-only cache– Reduce # of memory transactions to a GPU global memory

Optimizing data copy between CPU and GPU– Reduce amount of data copy

Eliminating redundant exception checks– Reduce # of instructions in GPU binary

20 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 21: Transparent GPU Exploitation for Java

Reducing # of memory transactions to GPU global memory

Aligning the starting address of an array body in GPU global

memory with memory transaction boundary

21 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

0 128

a[0]-a[31]

Object header

Memory address

a[32]-a[63]Naivealignmentstrategy

a[0]-a[31] a[32]-a[63]

256 384

Ouralignmentstrategy

One memory transaction for a[0:31]

Two memory transactions for a[0:31]

IntStream.range(0,n).parallel().forEach(i->{

...= a[i]...; // a[] : float

...;});

a[64]-a[95]

a[64]-a[95]

A 128-byte memorytransaction boundary

Page 22: Transparent GPU Exploitation for Java

Using Read-Only Cache

Prepare two versions of GPU code and execute 1. if a != b and

a != c1. Use read-only cache for a[i]

2. Use no read-only cache for a[i]

22 Easy and High Performance GPU Programming for Java Programmers

Equivalent to CUDA code

void foo(float[] a, float[] b, float[] c, int n) {if ((a[] != b[]) && (a[] != c[])) {// 1.IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;

});} else {// 2. execute code w/o a read-only cache

}}

// Equivalent to CUDA code__device__ foo(*a, *b, *c, N)

b[i] = __ldg(&a[i]) * 2.0;c[i] = __ldg(&a[i]) * 3.0;

}

// originalIntStream.range(0,n).parallel().forEach(i->{

b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

});

Page 23: Transparent GPU Exploitation for Java

Optimizing Data Copy between CPU and GPU

Eliminate data copy from GPU if an array (e.g. a[]) is not

updated in GPU binary [Jablin11][Pai12]

Copy only a read or write set if an array index form is

‘i + constant’ (the set is contiguous)

23 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

sz = (n – 0) * sizeof(float)cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read setcudaMemCopy(&b[0], d_b, sz, H2D);cudaMemCopy(&c[0], d_c, sz, H2D);IntStream.range(0, n).parallel().forEach( i -> {

b[i] = a[i]...;c[i] = a[i]...;

});cudaMemcpy(a, d_a, sz, D2H);cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write setcudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set

Page 24: Transparent GPU Exploitation for Java

Optimizing Data Copy between CPU and GPU

Eliminate data copy between CPU and GPU[Pai12]– if an array (e.g., a[] and b[]), which was accessed on GPU, is not

accessed on CPU

24

// Data copy for a[] from CPU to GPUfor (int t = 0; t < T; t++) {IntStream.range(0, N*N).parallel().forEach(idx -> {b[idx] = a[...];

});// No data copy for b[] between GPU and CPUIntStream.range(0, N*N).parallel().forEach(idx -> {a[idx] = b[...];

}// No data copy for a[] between GPU and CPU

}// Data copy for a[] and b[] from GPU to CPU

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 25: Transparent GPU Exploitation for Java

Eliminating Redundant Exception ChecksGenerate GPU code without exception checks by using

– loop versioning [Artigas00] that guarantees safe region by using pre-

condition checks on CPU

25 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

if (// check cond. for NullPointerExceptiona != null && b != null && c != null &&// check cond. for ArrayIndexOutOfBoundsExceptiona.length <l n && b.length <l n && c.length <l n) {...<<<...>>> GPUbinary(...)...

} else {// execute this construct on CPU// to produce an exception// under the original exception semantics

}

IntStream.range(0,n).parallel().forEach(i->{

b[i] = a[i]...;c[i] = a[i]...;

});

GPU binary for { // safe region:// no exception// check is requiredi = ...; b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;

}

Page 26: Transparent GPU Exploitation for Java

Automatically Optimized for CPU and GPUCPU code

– handles GPU device memory management and data copying

– checks whether optimized CPU and GPU code can be executed

GPU code

is optimized–Using

read-only

cache

– Eliminating

exception

checks

26 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

if (a != null && b != null && c != null &&a.length < n && b.length < n && c.length < n &&(a[] != b[]) && (a[] != c[])) {

cudaMalloc(d_a, a.length*sizeof(float)+128);if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);

int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz;cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);

<<...>> GPU(d_a, d_b, d_c, n) // launch GPU

cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c);

} else {// execute CPU binary

}

CPU

__global__ void GPU(float *a,float *b, float *c, int n)

{// no exception checksi = ...b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;

}

GPU

Page 27: Transparent GPU Exploitation for Java

Benchmark ProgramsPrepare sequential and parallel stream API versions in Java

27 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Name Summary Data size Type

Blackscholes Financial application that calculates the price of put and call

options

4,194,304 virtual

options

double

MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double

Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte

Series the first N fourier coefficients of the function [Java Grande

Benchamark]

N = 1,000,000 double

SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double

MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float

Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int

Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int

Page 28: Transparent GPU Exploitation for Java

Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions

Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread

Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads

Degrade performance for SpMM and Gesummv against 160 CPU threads

Transparent GPU Exploitation for Java / Kazuaki Ishizaki28

Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memorywith one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)Ubuntu 14.10, CUDA 5.5Modified IBM Java 8 runtime for PowerPC

Page 29: Transparent GPU Exploitation for Java

Performance Impact of Each Optimization MM: LV/DC/ALIGN/ROC are very effective

BlackScholes: DC is effective

MRIQ: LV/ALIGN/ROC is effective

SpMM and Gesummv: data transfer time for large arrays is dominant

Apply optimizations cumulatively BASE: Disabled our four optimizations

LV: loop versioning

DC: data copy

ALIGN: alignment optimization

ROC: read-only cache

Breakdown of the execution time

29 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 30: Transparent GPU Exploitation for Java

0.85

0.45

1.51

0.920.74

0.11

1.19

3.47

0

0.5

1

1.5

2

2.5

3

3.5

4

BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv

Sp

eed

up

rela

tive

to

CU

DA

Performance Comparison with Hand-Coded CUDA Achieve 0.83x on geomean over CUDA

Crypt, Gemm, and Gesummv: usage of a read-only cache

BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)

SpMM: overhead of exception checks

MRIQ: miss of ‘-use-fast-math’ compile option

MM: lack of usage of shared memory with loop tiling

30 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Higher is better

Page 31: Transparent GPU Exploitation for Java

GPU Version is Slower than Parallel CPU Version

Transparent GPU Exploitation for Java / Kazuaki Ishizaki31

Can we choose an appropriate device (CPU or GPU) to avoid

performance degradation?–Want to make sure to achieve equal or better performance

Page 32: Transparent GPU Exploitation for Java

Machine-learning-based Performance HeuristicsConstruct a binary prediction model offline by supervised

machine learning with support vector machines (SVMs)– Features Loop range

Dynamic number of instructions (memory access, arithmetic operation, …)

Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])

Data transfer size (CPU to GPU, GPU to CPU)

32 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

data1Bytecode

App A feature 1Features

extraction

LIBSVM JavaRuntime

PredictionModel

data2Bytecode

App A feature 2Features

extraction

data3Bytecode

App B feature 3Features

extraction

CPU GPU

Page 33: Transparent GPU Exploitation for Java

Most Predictions are CorrectUse 291 cases to build model

Succeeded in predicting cases of performance degradations on GPU

Failed to predict BlackScholes

Transparent GPU Exploitation for Java / Kazuaki Ishizaki33

Prediction

1.8->1.0 0.8->1.0 0.4->1.0

Page 34: Transparent GPU Exploitation for Java

Related WorkOur research enables memory and communication

optimizations with machine-learning-based device selection

34 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Work Language Exception

support

JIT

compiler

How to write GPU kernel Data copy

optimization

GPU memory

optimization

Device selection

JCUDA Java × × CUDA Manual Manual GPU only

JaBEE Java × √ Override run method × × GPU only

Aparapi Java × √Override run

method/Lambda× × Static

Hadoop-CL Java × √Override map/reduce

method× × Static

Rootbeer Java × √ Override run method Not described × Not described

[PPPJ09] Java √ √ Java for-loop Not described ×Dynamic with

regression

HJ-OpenCLHabanero-

Java√ √ Forall constructs √ × Static

Our work Java √ √Standard parallel

stream API√

ROCache /

alignment

Dynamic with

machine learning

Page 35: Transparent GPU Exploitation for Java

Future Work Exploiting shared memory

–Not easy to predict performance

– Require non-lightweight analysis for identifying reuse

Supporting additional Java operations

35 Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Page 36: Transparent GPU Exploitation for Java

GPU Exploitation in Apache Spark

Page 37: Transparent GPU Exploitation for Java

What is Apache Spark? Framework that processes distributed computing by transforming

distributed immutable memory structure using set of parallel operations e.g. map(), filter(), reduce(), …

– Distributed immutable in-memory structures RDD (Resilient Distributed Dataset), DataFrame, Dataset

– Scala is primary language for programming on Spark

Provide domain specific libraries

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Spark Runtime (written in Java and Scala)

Spark

Streaming

(real-time)

GraphX

(graph)

SparkSQL

(SQL)

MLlib

(machine

learning)

Java Virtual Machine

tasks Executor

Driver

Executor

results

ExecutorData

Data

Data

Open source: http://spark.apache.org/

Data Source (HDFS, DB, File, etc.)

Latest version is 2.0.3 released in 2016/11

37

Page 38: Transparent GPU Exploitation for Java

How Program Works on Apache Spark Parallel operations can be executed among partitions

In a partition, data can be processed sequentially

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

case class Pt(x: Int, y: Int)val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDSval ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)

ds1 ds2

p.x+1p.y*2

p1.x + p2.x

9

5

14partition

partition

cnt

54

32

+ =

+ =1 5

2 6

partition

pt

partition

38

2 10

3 12

3 7

4 8

4 14

5 16

Page 39: Transparent GPU Exploitation for Java

How We Can Run Program Faster on GPU Assign many parallel computations into cores

Make memory accesses coalesce

– Column-oriented layout results in better performance [Che2011] reports on about 3x performance improvement of GPU kernel execution of

kmeans with column-oriented layout over row-oriented layout

1 52 61 5 3 7

Assumption: 4 consecutive data elements

can be coalesced using GPU hardware

2 v.s. 4memory accesses to

GPU device memoryRow-oriented layoutColumn-oriented layout

Pt(x: Int, y: Int)Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)Load four Pt.xLoad four Pt.y

2 6 4 843 87

coresx1 x2 x3 x4cores

Load Pt.x Load Pt.y Load Pt.x Load Pt.y

1 2 31 2 4

y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4

Transparent GPU Exploitation for Java / Kazuaki Ishizaki39

Page 40: Transparent GPU Exploitation for Java

Idea to Transparently Exploit GPUs on Apache Spark

Generate GPU code from a set of parallel operations–Made it in another research already

Physically put distributed immutable in-memory structures

(e.g. Dataset) in column-oriented representation–Dataset is statically typed, but physical layout is not specified in program

Transparent GPU Exploitation for Java / Kazuaki Ishizaki40

Page 41: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Overview of GPU Exploitation on Apache Spark

User’s Spark Program

case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS

ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

Na

tive

co

de

GPU

10

12

14

+ 1

=

* 2 =

ds1

Datatransfer

x y x y

ds2

partitionGPU

kernel

CPU

16

2

3

4

5

10

12

14

16

2

3

4

5

5

6

1

2

7

8

3

4

5

6

1

2

7

8

3

4

41

Page 42: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

Overview of GPU Exploitation on Apache Spark Efficient

– Reduce data copy overhead between CPU and GPU

– Make memory accesses efficient on GPU

Transparent– Map parallelism in program

into GPU native code

User’s Spark Program

case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS

ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)

Drive

GPU native

code

Na

tive

co

de

GPU

+ 1

=

* 2 =

ds1

Datatransfer

x y

GPU manager

Columnar storage

x y

GPU can exploit parallelism bothamong partitions in Dataset andwithin a partition of Dataset

ds2

partitionGPU

kernel

CPU

Mem

ory

ad

dre

ss

42

10

12

14

16

2

3

4

5

10

12

14

16

2

3

4

5

5

6

1

2

7

8

3

4

5

6

1

2

7

8

3

4

Page 43: Transparent GPU Exploitation for Java

Exploit Parallelism Between GPU Kernels Overlap data transfers and computations among different GPU kernels on

a GPU

Data transfer

CPU to GPU

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

GPUkernel

Time

Spark worker

for GPU

Spark worker

for GPU

Spark worker

for GPUData transfer

GPU to CPU

Transparent GPU Exploitation for Java / Kazuaki Ishizaki43

Page 44: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

How We Write Program And What is Executed Write a program using a relational operation for DataFrame or a lambda

expression for Dataset.

Catalyst performs optimization and code generation for the program.

The corresponding Java bytecode for the generated Java code is executed.

ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

df1 = data.toDF(…)df2 = df2.selectExpr("x+1")df2.agg(sum())

Frontend

API

DataFrame (v1.3-) Dataset (v1.6-)

Backend

computationCatalyst-generated Java bytecode

Java code

Catalyst

2 61 5

Java heap

Row-oriented Data

data =Seq(Pt(1, 5),Pt(2, 6))

44

“Catalyst” is a code-name

for optimizer and code generator

in Apache Spark

Page 45: Transparent GPU Exploitation for Java

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

How Program is Executed on GPU For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for

GPU.

A just-in-time compiler in Java virtual machine can generate GPU code.

ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)

df1 = data.toDF(…)df2 = df2.selectExpr("x+1")df2.agg(sum())

Frontend

API

Backend

computationAutomatically generated GPU code

Optimized Java code

Enhanced Catalyst

Data2 61 5

GPU device memory

Column-oriented

45

DataFrame (v1.3-) Dataset (v1.6-)data =Seq(Pt(1, 5),Pt(2, 6))

Page 46: Transparent GPU Exploitation for Java

Pseudo Java Code by Current Catalyst Perform optimization that merges multiple parallel operations

(selectExpr() and agg(sum()) into one loop

int sum = 0while (rowIterator.hasNext()) { // iterator-based access

Row row = rowIterator.next(); // for df1int x = row.getInteger(0);// selectExpr(x + 1)

int x_new = x + 1; // for df2sum += x_new;

}

val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

Generated code corresponds to selectExpr() and local sum()

1

3

1

0

-1

-1 0

DataFrame program for Spark

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

20 1

Read sequentially

46

df1

x

x_new

sum

Row-orientedCatalyst

Generated pseudo Java code

Page 47: Transparent GPU Exploitation for Java

Pseudo Java Code by Enhanced Catalyst Get column0 from column-oriented storage

For-loop can be executed in a parallel reduction manner

Column column0 = df1.getColumn(0); // df1int sum = 0;for (int i = 0; i < column0.numRows; i++) {

int x = column0.getInteger(i);// selectExpr(x + 1)

int x_new = x + 1; // for df2sum += x_new;

}

1

10-1

-1 0

Generated pseudo Java code

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

3

20 1

47

df1

x

x_new

sum

Column-orientedEnhanced Catalyst

Page 48: Transparent GPU Exploitation for Java

Generate GPU Code Transparently from Spark Program

Copy column-oriented storage into GPU

Execute add and reduction in one GPU kernel

Column column0 = df1.getColumn(0);int nRows = column0.numRows;cudaMalloc(&d_c0, nRows*4);cudaMemcpy(d_c0, column0, nRows, H2D);int sum = 0;cudaMalloc(&d_sum, 4);cudaMemcpy(d_c0, &sum, 4, H2D);<<...>> GPU(d_c0, d_sum, nRows) // launch GPUcudaMemcpy(d_c0, &sum, 4, D2H);cudaFree(d_sum); cudaFree(d_c0);

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())

// GPU code__global__ void GPU(int *d_c0, int *d_sum, long size) {long ix = … // 0, 1, 2if (size <= ix) return;int x = d_c0[ix];int x_new = x + 1;reduction(d_sum, x_new);

}

48

1

10-1

-1 0

3

20 1

x

x_new

d_sum

d_c0

Execute in parallel

Generated CPU code

Page 49: Transparent GPU Exploitation for Java

Related Work Spark With Accelerated Tasks [Grossman2016]

– Generate GPU code from lambda function in map() in RDD

– Very similar to enhanced Catalyst using columnar storage to transparently

exploit GPUs. However, work for RDD with map()

GPU Columnar (proposed by Kiran Lonikar)– Generate GPU code from program using select() method in DataFrame

– Very similar to enhanced Catalyst using columnar storage to transparently

exploit GPUs

Transparent GPU Exploitation for Java / Kazuaki Ishizaki

val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))val doubledRDD = inputRDD.map(i => 2 * i)

49

Page 50: Transparent GPU Exploitation for Java

TakeawayHow system generates hardware accelerator code from

program with high-level abstraction–Most of programmers are not Ninja programmers

–Compiler can transform program for hardware features, but

does not want to do trial and error at runtime

–How can compiler and hardware build good relationship?

(Not talked today) What can we do for deep learning?–Current deep learning frameworks use GPU by calling libraries (e.g.

cnDNN/cuRNN by NVIDIA)

–How will system support rapid evolution in deep learning?

New neural network structures are still proposed

Transparent GPU Exploitation for Java / Kazuaki Ishizaki50