insideagpgpumanaged platformwithanauto-tuningjit compiler · pdf fileinsideagpgpumanaged...

39
Inside a GPGPU Managed Platform with an Auto-tuning JIT Compiler Chris Jang [email protected] http://fastkor.wordpress.com/

Upload: letram

Post on 06-Mar-2018

233 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Inside a GPGPU ManagedPlatform with an Auto-tuning JIT

Compiler

Chris Jang

[email protected]

http://fastkor.wordpress.com/

Page 2: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Three questions

Auto-tuning on the GPU. . .

1. What is it?

2. Why is it important?

3. How is it done?

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.1

Page 3: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

What is GPGPU auto-tuning?

◮ Automated performance tuning

◮ Without programming. . .

⊲ Find code that runs fast

⊲ Adapt code to hardware

⊲ Optimize code for SDK/driver

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.2

Page 4: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Why is auto-tuning important?

Same GPGPU algorithm. . . over 100x speed difference

DGEMM 400x400 memory buffers (ATI HD 5870)histogram from EM auto-tuning over model family

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.3

Page 5: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Why is auto-tuning important?

Same GPGPU algorithm. . .adapts to GPU hardware

DGEMM 400x400 memory buffers (NVIDIA GTX 480)histogram from EM auto-tuning over model family

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.4

Page 6: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Why is auto-tuning important?

Same GPGPU algorithm. . .optimizes for SDK/driver

DGEMM 400x400 memory buffers (ATI SDK 2.1 and 2.2)histogram from EM auto-tuning over model family

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.5

Page 7: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Why is auto-tuning important?

Same GPGPU algorithm. . . no programming, fast codeoptimization, adaptation

◮ ATI HD 5870 (Evergreen)

◮ NVIDIA GTX 480 (Fermi)

◮ SDK 2.1 and SDK 2.2 for ATI OpenCL

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.6

Page 8: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

How is auto-tuning done?

This is really two questions. . .

1. How are fast math kernels found?

2. How to integrate with a virtual machine JIT?

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.7

Page 9: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

How are fast math kernels found?

1. Parameterized kernel family

◮ Designed by hand

◮ Compiler generated

2. Search using statistical optimization

◮ Expectation-Maximization works!

◮ Maximize expected throughput (GFLOPS)

◮ Examples: GEMM and GEMV

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.8

Page 10: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

How to integrate with a VM JIT?

Why ask this question. . .

◮ Fast GPGPU kernels are not enough

⊲ Memory management - I/O can dominateperformance

⊲ GPGPU programming is still too hard

◮ Natural solution is managed platform for GPGPU

⊲ Application virtual machine

⊲ Garbage collection

⊲ Just-in-time auto-tuning compiler

⊲ High-level programming language

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.9

Page 11: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

GPGPU kernels are not enough

Let’s see some examples. . .

◮ Memory management - I/O can dominateperformance

◮ GPGPU programming is still too hard

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.10

Page 12: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Memory management is important

Memory objects can be very expensive

SGEMM 1600x1600 AB images, C buffer (ATI HD 5870)histogram from EM auto-tuning over model family

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.11

Page 13: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Memory costs can be huge

Order of magnitude difference in peak throughput!

◮ 1340 GFLOPS if no I/O, all memory on GPU

◮ 660 GFLOPS if read back result

◮ 140 GFLOPS if send input data

◮ 120 GFLOPS if send input data and read back result

SGEMM 1600x1600 AB images, C buffer (ATI HD 5870)

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.12

Page 14: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

What are the memory costs?

◮ Data transfer over the PCIe bus

⊲ Bus slower than RAM

⊲ DMA not always available

⊲ Bus controllers not all the same (mainboard/GPU)

⊲ May need to copy application to driver memory

◮ Memory allocation on GPU

⊲ Textures (images) expensive

⊲ Memory buffers cheap

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.13

Page 15: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Familiar CPU programming

Add two arrays on the CPU. . .float a[1000], b[1000], c[1000];

void fooCPU(float* c, float* a, float* b)

{

for (int i = 0; i < 1000; i++)

c[i] = a[i] + b[i];

}

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.14

Page 16: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Unfamiliar GPGPU programming

Add two arrays on the GPU. . .__kernel void fooGPU(__global float* c,

__global float* a,

__global float* b) {

c[get_global_id(0)]

= a[get_global_id(0)]

+ b[get_global_id(0)];

}

size_t global[1], local[1];

global[0] = 1000;

local[0] = 1;

clEnqueueNDRangeKernel(..., global, local,

...);

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.15

Page 17: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

That looked easy!

It is easy for two reasons. . .

◮ Simple algorithm (entry-wise addition)

◮ No performance tuning

Productivity issues. . .

◮ GPGPU code too verbose

◮ Need to hide boilerplate in libraries

◮ Additional resource management for GPGPU

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.16

Page 18: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

GPGPU as a managed platform

GPGPU auto-tuning solves some problems. . .

◮ finds fast code and adapts without programming

◮ . . . does not manage memory for us

◮ . . . low productivity in C-like GPGPU languages

Natural solution is

◮ Application virtual machine

◮ Garbage collection

◮ Just-in-time auto-tuning compiler

◮ High-level programming language

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.17

Page 19: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

An implementation of these ideas

Chai. . .a managed platform and language for GPGPU

◮ Free, open source

◮ Inspired by PeakStream

◮ Array programming language as C++ DSL

◮ Under development since 2010

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.18

Page 20: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Chai language summary

◮ Array is fundamental type

◮ Arithmetic and predicate operators

◮ Selection, gathering, indexes

◮ Data creation

◮ RNG (CPU only for now)

◮ Auto-tuned matrix multiply

◮ Reductions

◮ Math, common, relational functions (POSIX/OpenCL)

◮ Extensible at runtime

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.19

Page 21: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Sample: Data parallel reduction

double cpuA[P][N * N], cpuB[P][N * N];

vector<double*> dataA, dataB; // parallel

for (int i = 0; i < P; i++) {

dataA.push_back(cpuA[i]);

dataB.push_back(cpuB[i]); }

Arrayf64 D;

{

Arrayf64 A = make2(N, N, dataA);

Arrayf64 B = make2(N, N, dataB);

D = sum(matmul(A, B));

}

const vector<double> d = D.read_scalar(P);

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.20

Page 22: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Sample: Managed and unmanaged

for (size_t i = 0; i < 5; i++)

{

// managed on GPUArrayf64 B;

{

Arrayf64 A = make1(10, cpuA);

B = exp(A);

}

B.read1(cpuB, 10 * sizeof(double));

// unmanaged on CPUcpuB[i] += i;

}

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.21

Page 23: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Sample: Conjugate-gradient (1/2)

Arrayf32 A = Arrayf32::make2(N, N, cpuA);

Arrayf32 b = Arrayf32::make1(N, cpub);

Arrayf32 residuals = b - matmul(A, x);

Arrayf32 p = residuals;

Arrayf32 newRR

= dot_product(residuals, residuals);

for (iter = 0; iter < N; iter++) {

Arrayf32 oldRR = newRR;

Arrayf32 newX, newP, newResiduals;

Arrayf32 Ap = matmul(A, p);

Arrayf32 dp = dot_product(p, Ap);

newX = x + p * oldRR / dp;

(PeakStream demo code)15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.22

Page 24: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Sample: Conjugate-gradient (2/2)

newResiduals

= residuals - Ap * oldRR / dp;

newRR = dot_product(newResiduals,

newResiduals);

newP = newResiduals + p * newRR / oldRR;

p = newP;

residuals = newResiduals;

float oldRRcpu = oldRR.read_scalar();

if (oldRRcpu <= TOLERANCE)

break;

x = newX;

}

(PeakStream demo code)15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.23

Page 25: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Basic concepts

◮ An odd couple: CPU and GPU

◮ Arithmetic intensity

◮ GPU memory hierarchy

◮ Problem blocking

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.24

Page 26: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

An odd couple: CPU and GPU

Software design rules must be different. . .

CPU GPU

ALUs few many

memory cache controller program code

control flow MIMD SIMD

programming mixed paradigm kernel/functional

dominant cost time space

effectivememorybandwidth

high low

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.25

Page 27: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Arithmetic intensity

Number crunching relative to amount of data:

O(time)

O(space)

GPUs have:

◮ Huge number of ALUs

◮ Low effective memory bandwidth

which means:

◮ O(time) should be large

◮ O(space) should be small

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.26

Page 28: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

GPU memory hierarchy

Six kinds of memory. . .

host driver

global

texture

local private

global RAM on the GPU

texture Very fast but accessed with coordinates

local Fast scratchpad for running kernels

private Register variables for running kernels

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.27

Page 29: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Problem blocking

inner blocking kernel registers, arithmetic intensity

outer blocking GPU core work groups, local memory

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.28

Page 30: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Basic concepts summary

GPGPU performance needs:

1. High arithmetic intensity

2. Efficient data movement through memory hierarchy

3. Optimal blocking

◮ Maximize arithmetic intensity

◮ Minimize register pressure (GPU threadscheduling)

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.29

Page 31: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Auto-tuning

1. In isolation. . .

◮ Expectation-Maximization

◮ Curse of dimensionality

◮ Memoization and journaling

2. With the JIT. . .

◮ Everything fails

◮ Dynamic auto-tuning

◮ Auto-tune everything?

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.30

Page 32: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Expectation-Maximization

α parameters outer and inner blocking

β parameters loop order, miscellaneous

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.31

Page 33: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Curse of dimensionality

Auto-tuning doesn’t find everything. . .

Auto-tuned Fixed

problem blocking X

loop order X

miscellaneous X

variable types X

dimensions X

data layout X

Why not? Curse of Dimensionality!

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.32

Page 34: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Curse of dimensionality

EM search fails to converge, so use brute force. . .

◮ Fixed parameters in outer loop

◮ Tuned parameters found with EM auto-tuning

◮ Expensive, but only have to do once

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.33

Page 35: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Memoization and journaling

Memoization. . .

◮ Faster search (typical recursion trick)

◮ Auto-tuning takes a long time (hours, days)

◮ Thousands of kernel variations

Journaling. . .

◮ Don’t lose time if something hangs or crashes

◮ Resume search from where left off

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.34

Page 36: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Everything fails

Will crash or hang sometime: compiler, driver, kernel

◮ failures impossible to predict accurately

◮ auto-tuning is a stress test for technology stack

◮ auto-tuning even more important:

⊲ learn the good kernels ahead-of-time

⊲ application VM and JIT avoid bad ones at runtime

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.35

Page 37: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Dynamic auto-tuning

Cold start ahead-of-time:

◮ Learn structure of compute device

◮ Expensive statistical optimization

◮ Pay this cost once up-front and re-use results

Warm start just-in-time:

◮ Adapt to runtime state of GPU

◮ Relatively cheap as structure already known

◮ Acceptable JIT compiler warm up overhead

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.36

Page 38: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

Auto-tune everything?

No, only auto-tune high arithmetic intensity kernels. . .

◮ Good return on investment

⊲ JIT work is not cheap

⊲ Too much JIT limits throughput (Amdahl’s law)

◮ Converges easily as convexity is strong enough

⊲ Side-effect of good ROI

◮ Low arithmetic intensity kernels will always be slow

⊲ Side-effect of bad ROI

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.37

Page 39: InsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler · PDF fileInsideaGPGPUManaged PlatformwithanAuto-tuningJIT Compiler Chris Jang ... MS6 Towards Smart Auto-tuning for HPC

There’s not enough time. . .

◮ Application virtual machine

⊲ C++ domain specific language

⊲ Bytecode stack machine

⊲ Tracing JIT and interpreter

⊲ Gather/scatter scheduler

⊲ JIT generates OpenCL

◮ Memory management

⊲ Reference counted garbage collection

⊲ Tracing JIT enqueues, interpreter swizzles

⊲ Sticky continuation

◮ JIT middle-end. . .

15th SIAM Conference on Parallel Processing for Scientific Computing, MS6 Towards Smart Auto-tuning for HPC – p.38