using gpus to handle big data with java

© 2014 IBM Corporation

Tim Ellison – IBM Java Technology Center

October 2nd, 2014

Using GPUs to Handle Big Data with JavaCON3285

© 2014 IBM Corporation

Important Disclaimers

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.

WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED ON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.

ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A GUIDE.

IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM, WITHOUT NOTICE.

IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.

NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF:

- CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS

2

© 2014 IBM Corporation3

About me

Based in the Java Technology Centre, Hursley UK

Working on various runtime technologies for >20 years

Experience of open source communities

Currently focused on class library design and delivery

Overall technical lead for IBM Java 8 SE

[email protected]

@tpellison


Goals of this talk

Provide an introduction to programming with GPUs

Show how we can bring a popular GPU programming model to Java

Demonstrate the effect of GPUs on solving Big Data problems

Describe seamlessly enabling the Java implementation to use GPUs


Introduction to Graphics Processing Units

GPUs are no longer solely targeted at single purpose graphical operations such as rendering and texture mapping

Modern high-end GPUs are general purpose “stream processors”

Provide substantially more FLOPs per $ and per watt than CPUs

Programming language extensions allow a flow of control from the CPU to the single instruction, multiple thread (SIMT) engine


The

oret

ical

GF

LOP

/s

With their massively parallel architecture, GPUs far exceed the computational power of CPUs when

operating on large sets of floating point numbers.

How can we bring this capability to Java?

Harnessing the power of GPUs

http://docs.nvidia.com/cuda/cudacprogrammingguide


Introduction to Programming GPUs

Typical scenario for heterogeneous programming:

– Host computer with CPU(s) and GPU(s) installed on PCIe bus

– Programmer identifies parallelizable, compute intensive routine, and codes to GPU

– Flow of data and control passes between CPU host and GPU device under control of host device

We have focused on Nvidia CUDA as the programming model for exploiting GPUs.

Particularly suited to scientific and numerical analysis problems (e.g. linear algebra)


CUDA – a simple example

void add( int *a, int *b, int *c ) { for (i=0; i < N; i++) { c[i] = a[i] + b[i]; }}

void cuda_add(int *a, int *b, int *c) { int *dev_a, *dev_b, *dev_c; int len = N*sizeof(int);

cudaMalloc((void**)&dev_a, len); cudaMalloc((void**)&dev_b, len); cudaMalloc((void**)&dev_c, len);

cudaMemcpy(dev_a, a, len, cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, len, cudaMemcpyHostToDevice);

add<<<N,1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, len, cudaMemcpyDeviceToHost);

cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);}

+++

===

a b c

+++

===

Serially on the CPU...

On the CPU ...

__global__ void add(int *a, int *b, int *c) { int tid = blockIdx.x; c[tid] = a[tid] + b[tid];}

Vector addition

...and GPU


A simple example







__global__ void add(int *a, int *b, int *c) { int tid = blockIdx.x; c[tid] = a[tid] + b[tid];}

Language extension to invoke our kernel with 'N'

blocks executing in parallelone thread per block

Identify your block number

Add the single element whose index matches your block number

+++

===

a b c

+++

===


A simple example – the effect at runtime







Grid of 'N' blocksexecuting in parallel

__global__ void add(int *a,... c[2] = a[2] + b[2];}__global__ void add(int *a,...

c[1] = a[1] + b[1];}__global__ void add(int *a,...

c[0] = a[0] + b[0];}


Goal: Bringing GPU programming into Java

There are times when you want this low level GPU control from Java

Produce an API that reflects the concepts familiar in CUDA programming

Make use of Java exceptions, automatic resource management, etc.

Handle copying data to/from the GPU, flow of control from Java to GPU and back, etc

Ability to invoke existing GPU module code from Java applications e.g. Thrust

CudaDevice – a CUDA capable GPU deviceCudaBuffer – a region of memory on the GPUCudaModule – user library of kernels to load into GPUCudaKernel – for launching a device functionCudaFunction – a kernel's entry pointCudaEvent – for timing and synchronizationCudaException – for when something goes wrong

new Java APIs


Fundamental types in CUDA4J

CudaBufferCudaBuffer PTX

.func add { … }

.func foo { … }

.func bar { … }

CudaEvent

CudaFunction

CudaGlobal

CudaLinker

Java CudaSurfaceCudaTexture

CudaDevice

CudaDeviceCudaModule CudaKernel

CudaKernel

CudaBufferCudaBuffer

CudaGrid

add{}

Device events

CudaStream

CudaModule

Relationship for generating an instanceRelationship as an argument

Used to combine multiple cubin/fatbin/PTXsinto single module

Corresponds to a HW feature in GPU

CudaFunctionfoo{}

executionengine

devicememory


native modulecontaining the kernel

first GPU device

grid of kernels thatwill execute this task

move data from Java heap to device

invoke the task

Explicit GPU vector addition in Java

move the result back to the Java heap


Limitations and considerations

Allows developers to code explicitly for the GPU– These are new APIs that give close control of the device– Uses familiar concepts and paradigms for GPU experts– Convenience and productivity improvements from language– Fundamental building blocks for higher level algorithms

Requires the developer to identify suitable GPU workloads– Re-code routines to operate on data in parallel– Minimize branching flow of control in kernels

Amortizing overhead of moving work to GPU– Time taken to copy data between host and device over PCIe– Overhead of switching flow of control from CPU to GPU

© 2014 IBM CorporationFlickr: Gareth Halfacree


Applying GPUs to Big Data problems

Large retailers are collecting petabytes of information based oncustomer transactions.

Can we identify customers with similar behavior,such that retailers can create customized products,effective marketing campaigns, etc?

Businesses that react quickly to changes in market segmentshave an advantage.

Challenge: Can we bring the power of GPUs to big data problems written in Java?

Problem: Identify 20 – 25 year olds who are interested in smart phone products

Solution: Put a K-means algorithm onto the GPU to identify clusters in large dataset– Apache Mahout is a machine learning library for clustering and collaboration built on Apache Hadoop


K-means clustering – finding groups of data points Problem: Iteratively refine the location of 'K' loci, identifying clusters of 'N' data points, where

each data point has 'D' dimensions

Approach: For each data point, which of the loci is closest?– NP-hard problem requiring N x K x D independent operations– Adjust K loci and iterate towards a stable answer

The map computation can be parallelized (at a finer granularity than managed by Apache Hadoop)

A new GPU-enabled mapper class was created:– batches of points are transferred from Java to a GPU for mapping– a partial reduction is returned to the host– a single Apache Mahout class was modified – to use our new mapper

class

Wikimedia: Chire


Mapper speed-up

GPU vs. CPU

– Speed-up factor just doing the map/reduce for different valuesof K and D

– Baseline is standard Apache Mahout implementation on CPU

– Comparison excludes Apache Hadoop framework and I/O

End-to-end speed-up solving the problem was ~8x

IBM Power 8 with Nvidia K40m GPU

Number of loci being considered

Number of variables per data point

Speed-up multiplier for GPU compared to CPU


GPU-enabling standard Java SE APIs

Natural question after seeing the good speed-ups using explicit programming …

What areas of the standard Java API implementation are suitable for off-loading onto GPU?

We picked two candidates initially:

java.util.Arrays.sort(int[] a) and friends– GPU modules exist that do efficient sorting

java.nio.charset.CharsetEncoder– data-driven character set mapping of Strings

IBM Developer Kits for Javaibm.com/java/jdk

http://ibm.com/java/jdk


GPU-enhanced sorting from Java – heuristics

We employ heuristics that determine if the work should be off-loaded to the GPU.

Overhead of moving data to GPU, invoking kernel, and returning results means small sorts (<~20k elements) are faster on the CPU.

Host may have multiple GPUs. Are any available for the task?

Is there space for conducting the sort on the device?

Arrays.sort(myData);

Is the problemlarge enough?

Is there aGPU currently

available?

Is thedevice

capable?Sort on CPUSort on GPU

yes

yes

yes

no

no

no


GPU-enabled array sort method



GPU-enabled charset conversion

Did not achieve speed-ups on charset encoding/decoding

Cost of moving the data to/from GPU outweighed the benefit

We have spent effort optimizing the CPU version with a number of JIT specials

Rule of thumb on the system we have is that you need toexecute 'tens of instructions per data point' on the GPUin order to see a benefit over the CPU

Flickr: rifqidahlgren


Beyond specific APIs – Java 8 streams

Streams allow developers to express computation as aggregate parallel operations on data

For example:

IntStream.range(0, N).parallel().forEach(i > c[i] = a[i] + b[i]);

creates a stream whose operations can be executed in parallel

What if we could recognize the terminal operation and conduct it on the GPU? Reuses standard Java idioms, so no code changes required No knowledge of GPU programming model required by the application developer But no low-level manipulation of the device – the Java implementation has the controls Future smarts introduced into the JIT do not require application code changes


JIT optimized GPU acceleration

Early steps

– Recognize a limited set of operations within the lambda expressions,• notably no object references maintained on GPU

– Default grid dimensions and operating parameters for the GPU workload

– Redundant/pessimistic data transfer between host and device • Not using GPU shared memory

– Limited heuristics about when to invoke the GPU and when togenerate CPU instructions

As the JIT compiles a stream expression we can identify candidates for GPU off-loading– Arrays copied to and from the device implicitly– Java operations mapped to GPU kernel operations– Preserves the standard Java syntax and semantics bytecodes

intermediaterepresentation

optimizer

CPU GPU

code generator code generator

PTX ISACPU native


JIT / GPU optimization of Lambda expression

JIT recognized Java code for matrix multiplication using Java 8 parallel stream

Speed-up factor when run on a GPU enabled host



✔ GPUs provide a powerful co-processor suited to handling large amounts of data

✔ Workloads need to be carefully chosen to exploit the GPU's specific capabilities

✔ Custom APIs make the GPU accessible to Java application developers

✔ IBM's Java implementation is introducing heuristics to off-load calls to standard library APIs from CPU to GPU

✔ JIT optimization of user code will increasingly move your application code to custom co-processors, including GPU

✔ We have seen good speed-ups of around 8x to 10x with real world Big Data applications running on GPU devices

Conclusions


http://ibm.biz/javaone2014

http://ibm.biz/javaone2014


Copyright and Trademarks

© IBM Corporation 2014. All Rights Reserved.

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., and registered in many jurisdictions worldwide.

Other product and service names might be trademarks of IBM or other companies.

A current list of IBM trademarks is available on the Web – see the IBM “Copyright and trademark information” page at URL: www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/legal/copytrade.shtml