gpus graal

26
Enabling Heterogeneous Computing in Java with Graal Juan Fumero, Michel Steuwer, Christophe Dubach Introduction API Runtime Code Generation Data Management Results Conclusion Enabling Heterogeneous Computing in Java with Graal Juan Fumero, Michel Steuwer, Christophe Dubach The University of Edinburgh 7 July 2015 Truffle Workshop 1 / 26

Upload: juan-fumero

Post on 08-Aug-2015

47 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Enabling Heterogeneous Computing in Javawith Graal

Juan Fumero, Michel Steuwer, Christophe Dubach

The University of Edinburgh

7 July 2015Truffle Workshop

1 / 26

Page 2: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

1 Introduction

2 API

3 Runtime Code Generation

4 Data Management

5 Results

6 Conclusion

2 / 26

Page 3: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Heterogeneous Computing

NBody App (NVIDIA SDK) ˜105x speedup over seqLU Decomposition (Rodinia Benchmark) ˜10x over 32OpenMP threads

3 / 26

Page 4: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Cool, but how to program?

4 / 26

Page 5: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Example in OpenCL1 // create host buffers2 i n t ∗A, . . . .3 //Initialization4 . . .5 // platform6 c l u i n t numPlatforms = 0 ;7 c l p l a t f o r m i d ∗p l a t f o r m s ;8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &numPlatforms ) ;9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ;

10 s t a t u s = c l G e t P l a t f o r m I D s ( numPlatforms , p l a t f o r m s , NULL) ;11 c l u i n t numDevices = 0 ;12 c l d e v i c e i d ∗ d e v i c e s ;13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , &

numDevices ) ;14 // Allocate space for each device15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ;16 // Fill in devices17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices ,

d e v i c e s , NULL) ;18 c l c o n t e x t c o n t e x t ;19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ;20 cl command queue cmdQ ;21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ;22 cl mem d A , d B , d C ;23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , A, &s t a t u s ) ;24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , B, &s t a t u s ) ;25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ;26 . . .27 // Check errors28 . . .

5 / 26

Page 6: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Example in OpenCL

1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ;2 s o u r c e = r e a d s o u r c e ( s o u r c e F i l e ) ;3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( const char∗∗)&s o u r c e , NULL ,

&s t a t u s ) ;4 c l i n t b u i l d E r r ;5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ;6 // Create a kernel7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ;89 s t a t u s = c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl mem ) , &d A ) ;

10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ;11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ;1213 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ;14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5} ;1516 clEnqueueNDRangeKernel (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL ,

NULL) ;1718 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ;1920 // Free memory

6 / 26

Page 7: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

OpenCL example

1 k e r n e l vo idvecadd (

2 g l o b a l i n t ∗a ,3 g l o b a l i n t ∗b ,4 g l o b a l i n t ∗c ) {5

6 i n t i d x =7 g e t g l o b a l i d ( 0 ) ;8 c [ i d x ] = a [ i d x ] ∗

b [ i d x ] ;9 }

• Hello world App ˜ 250 lines ofcode (including errorchecking)

• Low-level and specific code

• Knowledge about targetarchitecture

• If GPU/accelerator changes,tuning is required

7 / 26

Page 8: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

OpenCL programming is hard and error-prone!!

8 / 26

Page 9: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Higher levels of abstraction

9 / 26

Page 10: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Higher levels of abstraction

10 / 26

Page 11: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Similar works

• Sumatra API (discontinued): Stream API for HSAIL

• AMD Aparapi: Java API for OpenCL

• NVIDIA Nova: functional programming language forCPU/GPU

• Cooperhead: subset of python than can be executed onheterogeneous platforms.

11 / 26

Page 12: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Our Approach

Three levels of abstraction:

• Parallel Skeletons: API based on functional programmingstyle (map/reduce)

• High-level optimising library which rewrites operations totarget specific hardware

• OpenCL code generation and runtime with datamanagement for heterogeneous architecture

12 / 26

Page 13: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Our approachOverview

Application +ArrayFunction API

Java Bytecode

(using Graal API)

OpenCL Kernel Generation

OpenCL Execution

Java source compilation

Java executiondotP.apply(input)

Accelerator

OpenCL Kernel

JOCL

output

13 / 26

Page 14: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Example: Saxpy

1 // Computation function2 ArrayFunc<Tuple2<F l o a t , F l o a t >, F l o a t> mult = new

MapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;3

4 // Prepare the input5 Tuple2<F l o a t , F l o a t > [ ] i n p u t = new Tuple2 [ s i z e ] ;6 f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) {7 i n p u t [ i ] . 1 = ( f l o a t ) ( i ∗ 0 . 3 2 3 ) ;8 i n p u t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ;9 }

10

11 // Computation12 F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;

If accelerator enabled, the map expression is rewritten in lowerlevel operations automatically.map(λ) = MapAccelerator(λ) =CopyIn().computeOCL(λ).CopyOut()

14 / 26

Page 15: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Our ApproachOverview

Ar r ayFunc

Map

MapThr eads

MapOpenCL

Reduce. . .appl y( ) { f or ( i = 0; i < s i ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ;}

appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ;}

appl y( ) { copyToDevi ce( ) ; execut e( ) ; copyToHost ( ) ;}

Funct i on

15 / 26

Page 16: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Runtime Code GenerationWorkflow

...10: aload_211: iload_312: aload_013: getfield16: aaload18: invokeinterface#apply23: aastore24: iinc27: iload_3...

Java sourceMap.apply(f)

Java bytecode

Graal VM

CFG + Dataflow(Graal IR)

void kernel ( global float* input, global float* output) { ...; ...;} OpenCL Kernel

3. optimizations

2. IR generation

4. kernel generation

1. Type inference

16 / 26

Page 17: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

OpenCL code generated1 double lambda0 ( f l o a t p0 ) {2 double c a s t 1 = ( double ) p0 ;3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;4 r e t u r n r e s u l t 2 ;5 }6 k e r n e l vo id l ambdaComputat ionKerne l (7 g l o b a l f l o a t ∗ p0 ,8 g l o b a l i n t ∗ p 0 i n d e x d a t a ,9 g l o b a l double ∗p1 ,

10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) {11 i n t p0 d im 1 = 0 ; i n t p1 d im 1 = 0 ;12 i n t gs = g e t g l o b a l s i z e ( 0 ) ;13 i n t l o o p 1 = g e t g l o b a l i d ( 0 ) ;14 f o r ( ; ; l o o p 1 += gs ) {15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 d im 1 ] ;16 b o o l cond 2 = l o o p 1 < p 0 l e n d i m 1 ;17 i f ( cond 2 ) {18 f l o a t auxVar0 = p0 [ l o o p 1 ] ;19 double r e s = lambd0 ( auxVar0 ) ;20 p1 [ p 1 i n d e x d a t a [ p1 d im 1 + 1 ] + l o o p 1 ]21 = r e s ;22 } e l s e { break ; }23 }24 }

17 / 26

Page 18: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Investigation of runtime for BS

Black-scholes benchmark.Float[] =⇒ Tuple2 < Float,Float > []

0.0

0.2

0.4

0.6

0.8

1.0

Am

ount of to

tal ru

ntim

e in %

Unmarshaling

CopyToCPU

GPU Execution

CopyToGPU

Marshaling

Java overhead

• Un/marshal data takesup to 90% of the time

• Computation stepshould be dominant

This is not acceptable. Can we do better?

18 / 26

Page 19: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Custom Array Type

Programmer's View

Tuple2

...

Graal-OCL VM

float float float float...

double double double double...

FloatBuffer

DoubleBuffer

...

0 1 2 n-1

...

0 1 2 n-1

0 1 2 n-1

float

double

Tuple2

float

double

Tuple2

float

double

Tuple2

float

double

...

PArray<Tuple2<Float,Double>>

With this layout, un/marshal operations are not necessary

19 / 26

Page 20: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Example of JPAI

1 ArrayFunc<Tuple2<F l o a t , Double >, Double> f = newMapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;

2

3 PArray<Tuple2<F l o a t , Double>> i n p u t = new PArray<>( s i z e ) ;4

5 f o r ( i n t i = 0 ; i < s i z e ; ++i ) {6 i n p u t . put ( i , new Tuple2 <>(( f l o a t ) i , ( double ) i + 2) ) ;7 }8

9 PArray<Double> output = f . a p p l y ( i n p u t ) ;

20 / 26

Page 21: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Setup

• 5 Applications

• Comparison with:• Java Sequential - Graal

compiled code• AMD and Nvidia GPUs• Java Array vs. Custom

PArray• Java threads

21 / 26

Page 22: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Java Threads Execution

0

1

2

3

4

5

6

small large

Saxpysmall large

K−Means

small large

Black−Scholes

small large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Number of Java Threads

#1 #2 #4 #8 #16

CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

22 / 26

Page 23: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

OpenCL GPU Execution

0.1

1

10

100

1000

small large

Saxpy

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

23 / 26

Page 24: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

OpenCL GPU Execution

0.1

1

10

100

1000

small largeSaxpy

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Spe

edup

vs.

Jav

a se

quen

tial

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

10x12x 70x

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

24 / 26

Page 25: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

.zip(Conclusions).map(Future)

Present

• We have presented an API to enable heterogeneouscomputing in Java

• Custom array type to reduce overheads when transfer thedata

• Runtime system to run heterogeneous applications withinJava

Future

• Runtime data type specialization

• Code generation for multiple devices

• Runtime scheduling (Where is the best place to run thecode?)

25 / 26

Page 26: Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Thanks so much for your attention

This work was supported bya grant from:

Juan Jose [email protected]

26 / 26