transparent gpu exploitation for java
TRANSCRIPT
Keynote at The Fourth International Symposium on Computing and Networking (CANDAR’16)
Kazuaki Ishizaki
IBM Research – Tokyo
Transparent GPU Exploitation for Java
1
My Research History
1992-1995 Static compiler for High Performance Fortran
1996-now Just-in-time compiler for IBM Developers Kit for
Java–1996-2000 Benchmark and GUI applications
–2000-2010 Web and Enterprise applications
–2012- Analytics applications 2014- Java language with GPUs
2015- Apache Spark (in-memory data processing framework)
with GPUs
2 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
My Research History
1990-1992 My master thesis with FPGA–Used XC3000 series with schematic editor Verilog and VHDL were just available
1992-1995 Static compiler for High Performance Fortran
1996-now Just-in-time compiler for IBM Developers Kit for
Java
3 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
What has Happened in HPC from 1995 to 2016 Program is becoming simpler
Hardware is becoming complicated
1995 2016
Hardware Fast scalar processors Commodity processors with hardware
accelerators
Applications Weather, wind, fluid, and
physics simulations
Machine learning and
deep learning with big data
Program Complicated and
hardware-dependent code
Simple and clean code
(e.g. mapreduce by Hadoop)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Example4
GPUPowerPC
Quiz: Can this program be executed in parallel?
5 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {
a[idx[i]] = i;}
}
Answer: Depend on idx[]Can this program be executed in parallel?
6 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {
a[idx[i]] = i;}
}
idx = {0, 1, 2, 3, …} idx = {0, 1, 0, 3, …}
Execute in parallel Execute sequentially
How Can We Know idx[]? (Word-based) Transactional memory
Parallelization analysis at–Compilation time: Not easy
–Runtime: Require much time
7 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {for (int i = 0; i < N; i++) {
a[idx[i]] = i;}
}
What We Want To Ask ProgrammerProgrammer usually knows everything
8 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void test(float a[], int idx[], int N) {#pragma parallel for (int i = 0; i < N; i++) {
a[idx[i]] = i;}
}
idx = {0, 1, 2, 3, …}
What We Do Not Want To Ask ProgrammerWhat Hardware Will This Program Use?
–CPU?
–GPU?
–FPGA?
–ASIC?
9 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
My Recent InterestHow system generates hardware accelerator code from
program with high-level abstraction–Expected (practical) result
People execute program without knowing usage of hardware accelerator
–Challenge How to optimize code for a certain hardware accelerator without specific
information
–On-going research GPU exploitation from Java program
GPU exploitation in Apache Spark
work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,
Madhusudanan Kandasamy , Gita Koblents -, Moriyoshi Ohara +,
Vivek Sarkar *, and Jan Wroblewski (intern) +
+ IBM Research – Tokyo, - IBM Canada, IBM India, * Rice University
10 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
GPU Exploitation from Java Program
Why Java for GPU Programming?High productivity
– Safety and flexibility
–Good program portability among different machines “write once, run anywhere”
–One of the most popular programming languages Hard to use CUDA and OpenCL for non-expert programmers
Many computation-intensive applications in non-HPC area–Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis (events in log files)
–Natural language processing (messages in social network system)
12 Transparent GPU Exploitation for Java / Kazuaki IshizakiFrom https://www.flickr.com/photos/dlato/5530553658
CUDA is a programming language for GPU offered by NVIDIA
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Write GPU Program Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
device memory
(up to 16GB)main memory
(up to 1TB/socket)
CPU GPU
Data copy over
PCIe or NVLink
dozen cores/socket thousands cores
13
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Optimize GPU Program
device memory
(up to 16GB)main memory
(up to 1TB/socket)
CPU GPUdozen cores/socket thousands cores
14
Exploit faster memory• Read-only cache (Read only)
• Shared memory (SMEM)
Data copy over
PCIe or NVLink
From GTC presentation by NVIDIA
Reduce data copy
Five steps 1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
Fewer Code Makes GPU Programming Easy
Current programming model requires programmers to
explicitly write operations for–managing device memories
– copying data
between CPU and GPU
– expressing parallelism
– exploiting faster memory
Java 8 enables programmers
to just focus on– expressing parallelism
15 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
void fooCUDA(N, float *A, float *B, int N) {int sizeN = N * sizeof(float);cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);cudaMemcpy(d_A, A, sizeN, Host2Device);GPU<<<N, 1>>>(d_A, d_B, N);cudaMemcpy(B, d_B, sizeN, Device2Host);cudaFree(d_B); cudaFree(d_A);
}// code for GPU__global__ void GPU(float* d_A, float* d_B, int N) {
int i = threadIdx.x;if (N <= i) return;d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache
}
void fooJava(float A[], float B[], int N) {// similar to for (idx = 0; i < N; i++)IntStream.range(0, N).parallel().forEach(i -> {
B[i] = A[i] * 2.0;});
}
GoalBuild a Java just-in-time (JIT) compiler to generate high
performance GPU code from a parallel loop construct
Implementing four performance optimizations
Offering performance evaluations on POWER8 with a GPU
Supporting Java language feature (See [PACT2015])
Predicting Performance on CPU and GPU [PPPJ2015]
Available in IBM Java 8 ppc64le and x86_64– https://www.ibm.com/developerworks/java/jdk/java8/
16 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Accomplishments
Parallel Programming in Java 8 Express parallelism by using parallel stream API among
iterations of a lambda expression (index variable: i)
17 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
class Par {void foo(float[] a, float[] b, float[] c, int n) {java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;
});}}
Reference implementation of Java 8 can execute this on multiple CPU threadsi =0 on thread 0
i = 3 on thread 1
i = 4 on thread 2
i = 1 on thread 3
i = 2 on thread 0
time
Portability among Different Hardware
A just-in-time compiler in IBM Java 8 runtime generates
native instructions– for a target machine including GPUs from Java bytecode
– for GPU which exploit device-specific capabilities more easily than
OpenCL
18
Javaprogram(.java)
Java bytecode(.class,.jar)
IBM Java 8 runtime
Target machine
Interpreter
just-in-time compiler
> javac Par.java > java Par for GPU
IntStream.range(0, n).parallel().forEach(i -> { ...
});
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of Our JIT Compiler Java bytecode
sequence is divided
into two intermediate
presentation (IR) parts– Lambda expression:
generate GPU code
using NVIDIA tool chain
(right hand side)
–Others:
generate CPU code
using conventional JIT
compiler (left hand side)
19 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
NVIDIA GPU binaryfor lambda expression
CPU binary for- managing device memory- copying data- launching GPU binary
Conventional Java JIT compiler
Parallel stream APIs detection
// Parallel stream codeIntStream.range(0, n).parallel()
.forEach(i -> { ...c[i] = a[i]...});
IR for GPUs...
c[i] = a[i]...
IR for CPUs
Java bytecode
CPU native code generator GPU native code
generator (by NVIDIA)
Additional modules for GPU
GPUs optimizations
Optimizations for GPU in Our JIT CompilerOptimizing alignment of Java arrays on GPUs
– Reduce # of memory transactions to a GPU global memory
Using read-only cache– Reduce # of memory transactions to a GPU global memory
Optimizing data copy between CPU and GPU– Reduce amount of data copy
Eliminating redundant exception checks– Reduce # of instructions in GPU binary
20 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Reducing # of memory transactions to GPU global memory
Aligning the starting address of an array body in GPU global
memory with memory transaction boundary
21 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]Naivealignmentstrategy
a[0]-a[31] a[32]-a[63]
256 384
Ouralignmentstrategy
One memory transaction for a[0:31]
Two memory transactions for a[0:31]
IntStream.range(0,n).parallel().forEach(i->{
...= a[i]...; // a[] : float
...;});
a[64]-a[95]
a[64]-a[95]
A 128-byte memorytransaction boundary
Using Read-Only Cache
Prepare two versions of GPU code and execute 1. if a != b and
a != c1. Use read-only cache for a[i]
2. Use no read-only cache for a[i]
22 Easy and High Performance GPU Programming for Java Programmers
Equivalent to CUDA code
void foo(float[] a, float[] b, float[] c, int n) {if ((a[] != b[]) && (a[] != c[])) {// 1.IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;
});} else {// 2. execute code w/o a read-only cache
}}
// Equivalent to CUDA code__device__ foo(*a, *b, *c, N)
b[i] = __ldg(&a[i]) * 2.0;c[i] = __ldg(&a[i]) * 3.0;
}
// originalIntStream.range(0,n).parallel().forEach(i->{
b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;
});
Optimizing Data Copy between CPU and GPU
Eliminate data copy from GPU if an array (e.g. a[]) is not
updated in GPU binary [Jablin11][Pai12]
Copy only a read or write set if an array index form is
‘i + constant’ (the set is contiguous)
23 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
sz = (n – 0) * sizeof(float)cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read setcudaMemCopy(&b[0], d_b, sz, H2D);cudaMemCopy(&c[0], d_c, sz, H2D);IntStream.range(0, n).parallel().forEach( i -> {
b[i] = a[i]...;c[i] = a[i]...;
});cudaMemcpy(a, d_a, sz, D2H);cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write setcudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
Optimizing Data Copy between CPU and GPU
Eliminate data copy between CPU and GPU[Pai12]– if an array (e.g., a[] and b[]), which was accessed on GPU, is not
accessed on CPU
24
// Data copy for a[] from CPU to GPUfor (int t = 0; t < T; t++) {IntStream.range(0, N*N).parallel().forEach(idx -> {b[idx] = a[...];
});// No data copy for b[] between GPU and CPUIntStream.range(0, N*N).parallel().forEach(idx -> {a[idx] = b[...];
}// No data copy for a[] between GPU and CPU
}// Data copy for a[] and b[] from GPU to CPU
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Eliminating Redundant Exception ChecksGenerate GPU code without exception checks by using
– loop versioning [Artigas00] that guarantees safe region by using pre-
condition checks on CPU
25 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
if (// check cond. for NullPointerExceptiona != null && b != null && c != null &&// check cond. for ArrayIndexOutOfBoundsExceptiona.length <l n && b.length <l n && c.length <l n) {...<<<...>>> GPUbinary(...)...
} else {// execute this construct on CPU// to produce an exception// under the original exception semantics
}
IntStream.range(0,n).parallel().forEach(i->{
b[i] = a[i]...;c[i] = a[i]...;
});
GPU binary for { // safe region:// no exception// check is requiredi = ...; b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;
}
Automatically Optimized for CPU and GPUCPU code
– handles GPU device memory management and data copying
– checks whether optimized CPU and GPU code can be executed
GPU code
is optimized–Using
read-only
cache
– Eliminating
exception
checks
26 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
if (a != null && b != null && c != null &&a.length < n && b.length < n && c.length < n &&(a[] != b[]) && (a[] != c[])) {
cudaMalloc(d_a, a.length*sizeof(float)+128);if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);
int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz;cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);
<<...>> GPU(d_a, d_b, d_c, n) // launch GPU
cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c);
} else {// execute CPU binary
}
CPU
__global__ void GPU(float *a,float *b, float *c, int n)
{// no exception checksi = ...b[i] = ROa[i] * 2.0;c[i] = ROa[i] * 3.0;
}
GPU
Benchmark ProgramsPrepare sequential and parallel stream API versions in Java
27 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Name Summary Data size Type
Blackscholes Financial application that calculates the price of put and call
options
4,194,304 virtual
options
double
MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double
Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte
Series the first N fourier coefficients of the function [Java Grande
Benchamark]
N = 1,000,000 double
SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double
MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float
Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int
Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions
Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread
Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads
Degrade performance for SpMM and Gesummv against 160 CPU threads
Transparent GPU Exploitation for Java / Kazuaki Ishizaki28
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memorywith one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)Ubuntu 14.10, CUDA 5.5Modified IBM Java 8 runtime for PowerPC
Performance Impact of Each Optimization MM: LV/DC/ALIGN/ROC are very effective
BlackScholes: DC is effective
MRIQ: LV/ALIGN/ROC is effective
SpMM and Gesummv: data transfer time for large arrays is dominant
Apply optimizations cumulatively BASE: Disabled our four optimizations
LV: loop versioning
DC: data copy
ALIGN: alignment optimization
ROC: read-only cache
Breakdown of the execution time
29 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
0.85
0.45
1.51
0.920.74
0.11
1.19
3.47
0
0.5
1
1.5
2
2.5
3
3.5
4
BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv
Sp
eed
up
rela
tive
to
CU
DA
Performance Comparison with Hand-Coded CUDA Achieve 0.83x on geomean over CUDA
Crypt, Gemm, and Gesummv: usage of a read-only cache
BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)
SpMM: overhead of exception checks
MRIQ: miss of ‘-use-fast-math’ compile option
MM: lack of usage of shared memory with loop tiling
30 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Higher is better
GPU Version is Slower than Parallel CPU Version
Transparent GPU Exploitation for Java / Kazuaki Ishizaki31
Can we choose an appropriate device (CPU or GPU) to avoid
performance degradation?–Want to make sure to achieve equal or better performance
Machine-learning-based Performance HeuristicsConstruct a binary prediction model offline by supervised
machine learning with support vector machines (SVMs)– Features Loop range
Dynamic number of instructions (memory access, arithmetic operation, …)
Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])
Data transfer size (CPU to GPU, GPU to CPU)
32 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
data1Bytecode
App A feature 1Features
extraction
LIBSVM JavaRuntime
PredictionModel
data2Bytecode
App A feature 2Features
extraction
data3Bytecode
App B feature 3Features
extraction
CPU GPU
Most Predictions are CorrectUse 291 cases to build model
Succeeded in predicting cases of performance degradations on GPU
Failed to predict BlackScholes
Transparent GPU Exploitation for Java / Kazuaki Ishizaki33
Prediction
1.8->1.0 0.8->1.0 0.4->1.0
Related WorkOur research enables memory and communication
optimizations with machine-learning-based device selection
34 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Work Language Exception
support
JIT
compiler
How to write GPU kernel Data copy
optimization
GPU memory
optimization
Device selection
JCUDA Java × × CUDA Manual Manual GPU only
JaBEE Java × √ Override run method × × GPU only
Aparapi Java × √Override run
method/Lambda× × Static
Hadoop-CL Java × √Override map/reduce
method× × Static
Rootbeer Java × √ Override run method Not described × Not described
[PPPJ09] Java √ √ Java for-loop Not described ×Dynamic with
regression
HJ-OpenCLHabanero-
Java√ √ Forall constructs √ × Static
Our work Java √ √Standard parallel
stream API√
ROCache /
alignment
Dynamic with
machine learning
Future Work Exploiting shared memory
–Not easy to predict performance
– Require non-lightweight analysis for identifying reuse
Supporting additional Java operations
35 Transparent GPU Exploitation for Java / Kazuaki Ishizaki
GPU Exploitation in Apache Spark
What is Apache Spark? Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures RDD (Resilient Distributed Dataset), DataFrame, Dataset
– Scala is primary language for programming on Spark
Provide domain specific libraries
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
ExecutorData
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.0.3 released in 2016/11
37
How Program Works on Apache Spark Parallel operations can be executed among partitions
In a partition, data can be processed sequentially
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
case class Pt(x: Int, y: Int)val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDSval ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)
ds1 ds2
p.x+1p.y*2
p1.x + p2.x
9
5
14partition
partition
cnt
54
32
+ =
+ =1 5
2 6
partition
pt
partition
38
2 10
3 12
3 7
4 8
4 14
5 16
How We Can Run Program Faster on GPU Assign many parallel computations into cores
Make memory accesses coalesce
– Column-oriented layout results in better performance [Che2011] reports on about 3x performance improvement of GPU kernel execution of
kmeans with column-oriented layout over row-oriented layout
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced using GPU hardware
2 v.s. 4memory accesses to
GPU device memoryRow-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)Load four Pt.xLoad four Pt.y
2 6 4 843 87
coresx1 x2 x3 x4cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
Transparent GPU Exploitation for Java / Kazuaki Ishizaki39
Idea to Transparently Exploit GPUs on Apache Spark
Generate GPU code from a set of parallel operations–Made it in another research already
Physically put distributed immutable in-memory structures
(e.g. Dataset) in column-oriented representation–Dataset is statically typed, but physical layout is not specified in program
Transparent GPU Exploitation for Java / Kazuaki Ishizaki40
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
User’s Spark Program
case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Na
tive
co
de
GPU
10
12
14
+ 1
=
* 2 =
ds1
Datatransfer
x y x y
ds2
partitionGPU
kernel
CPU
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
41
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
Transparent– Map parallelism in program
into GPU native code
User’s Spark Program
case class Pt(x: Int, y: Int)ds1 = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Drive
GPU native
code
Na
tive
co
de
GPU
+ 1
=
* 2 =
ds1
Datatransfer
x y
GPU manager
Columnar storage
x y
GPU can exploit parallelism bothamong partitions in Dataset andwithin a partition of Dataset
ds2
partitionGPU
kernel
CPU
Mem
ory
ad
dre
ss
42
10
12
14
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
Exploit Parallelism Between GPU Kernels Overlap data transfers and computations among different GPU kernels on
a GPU
Data transfer
CPU to GPU
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
GPUkernel
Time
Spark worker
for GPU
Spark worker
for GPU
Spark worker
for GPUData transfer
GPU to CPU
Transparent GPU Exploitation for Java / Kazuaki Ishizaki43
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How We Write Program And What is Executed Write a program using a relational operation for DataFrame or a lambda
expression for Dataset.
Catalyst performs optimization and code generation for the program.
The corresponding Java bytecode for the generated Java code is executed.
ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)
df1 = data.toDF(…)df2 = df2.selectExpr("x+1")df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-)
Backend
computationCatalyst-generated Java bytecode
Java code
Catalyst
2 61 5
Java heap
Row-oriented Data
data =Seq(Pt(1, 5),Pt(2, 6))
44
“Catalyst” is a code-name
for optimizer and code generator
in Apache Spark
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
How Program is Executed on GPU For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for
GPU.
A just-in-time compiler in Java virtual machine can generate GPU code.
ds1 = data.toDS()ds2 = ds2.map(p => p.x+1)ds2.reduce((a,b) => a+b)
df1 = data.toDF(…)df2 = df2.selectExpr("x+1")df2.agg(sum())
Frontend
API
Backend
computationAutomatically generated GPU code
Optimized Java code
Enhanced Catalyst
Data2 61 5
GPU device memory
Column-oriented
45
DataFrame (v1.3-) Dataset (v1.6-)data =Seq(Pt(1, 5),Pt(2, 6))
Pseudo Java Code by Current Catalyst Perform optimization that merges multiple parallel operations
(selectExpr() and agg(sum()) into one loop
int sum = 0while (rowIterator.hasNext()) { // iterator-based access
Row row = rowIterator.next(); // for df1int x = row.getInteger(0);// selectExpr(x + 1)
int x_new = x + 1; // for df2sum += x_new;
}
val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())
Generated code corresponds to selectExpr() and local sum()
1
3
1
0
-1
-1 0
DataFrame program for Spark
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
20 1
Read sequentially
46
df1
x
x_new
sum
Row-orientedCatalyst
Generated pseudo Java code
Pseudo Java Code by Enhanced Catalyst Get column0 from column-oriented storage
For-loop can be executed in a parallel reduction manner
Column column0 = df1.getColumn(0); // df1int sum = 0;for (int i = 0; i < column0.numRows; i++) {
int x = column0.getInteger(i);// selectExpr(x + 1)
int x_new = x + 1; // for df2sum += x_new;
}
1
10-1
-1 0
Generated pseudo Java code
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
3
20 1
47
df1
x
x_new
sum
Column-orientedEnhanced Catalyst
Generate GPU Code Transparently from Spark Program
Copy column-oriented storage into GPU
Execute add and reduction in one GPU kernel
Column column0 = df1.getColumn(0);int nRows = column0.numRows;cudaMalloc(&d_c0, nRows*4);cudaMemcpy(d_c0, column0, nRows, H2D);int sum = 0;cudaMalloc(&d_sum, 4);cudaMemcpy(d_c0, &sum, 4, H2D);<<...>> GPU(d_c0, d_sum, nRows) // launch GPUcudaMemcpy(d_c0, &sum, 4, D2H);cudaFree(d_sum); cudaFree(d_c0);
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
val df1 = (-1 to 1).toDF("x")val df2 = df1.selectExpr("x + 1")df2.agg(sum())
// GPU code__global__ void GPU(int *d_c0, int *d_sum, long size) {long ix = … // 0, 1, 2if (size <= ix) return;int x = d_c0[ix];int x_new = x + 1;reduction(d_sum, x_new);
}
48
1
10-1
-1 0
3
20 1
x
x_new
d_sum
d_c0
Execute in parallel
Generated CPU code
Related Work Spark With Accelerated Tasks [Grossman2016]
– Generate GPU code from lambda function in map() in RDD
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs. However, work for RDD with map()
GPU Columnar (proposed by Kiran Lonikar)– Generate GPU code from program using select() method in DataFrame
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs
Transparent GPU Exploitation for Java / Kazuaki Ishizaki
val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))val doubledRDD = inputRDD.map(i => 2 * i)
49
TakeawayHow system generates hardware accelerator code from
program with high-level abstraction–Most of programmers are not Ninja programmers
–Compiler can transform program for hardware features, but
does not want to do trial and error at runtime
–How can compiler and hardware build good relationship?
(Not talked today) What can we do for deep learning?–Current deep learning frameworks use GPU by calling libraries (e.g.
cnDNN/cuRNN by NVIDIA)
–How will system support rapid evolution in deep learning?
New neural network structures are still proposed
Transparent GPU Exploitation for Java / Kazuaki Ishizaki50