heterogeneous computing with rcuda - hpc …...basics of gpu computing remark: gpus can only be used...

Heterogeneous2 Computing

with rCUDA

Federico Silla

Universitat Politècnica de València

HPC Advisory Council Swiss Conference 2018

- What is heterogeneous2 computing?

- What is rCUDA?F. Silla @ HPC Advisory Council Swiss Conference 2018

Heterogeneous2 Computing

with rCUDA

What is rCUDA?

Outline

F. Silla @ HPC Advisory Council Swiss Conference 2018

Basics of GPU computing

Remark:

GPUs can only be used within

the node they are attached to

Basic behavior of CUDA

GPU


A different approach: remote GPU virtualization

No GPU


A different approach: remote GPU virtualization

A software technology that enables a more flexible

use of GPUs in computing facilities

No GPU

rCUDA is a development by Universitat Politècnica de València

rCUDA … remote CUDA

Access to remote GPU is

transparent to applications:

no source code

modification is needed

Basics or rCUDA


Basics or rCUDA


Access to remote GPU is

transparent to applications:

no source code

modification is needed

Physical

configuration

rCUDA allows a new vision of a GPU

deployment, moving from the usual cluster

configuration …

… to the following one:

Logical

configuration

rCUDA envision

Performance of rCUDA

Loweris better

• K20 GPU and FDR InfiniBand

• K40 GPU and EDR InfiniBand



CUDA-MEME

BarraCUDA

Loweris better

Loweris betterP100 GPU and EDR InfiniBand


Benefits of rCUDA?

Outline


Providing many GPUs to applications with rCUDA

Loweris better

MonteCarlo multi-GPU program running in 14 NVIDIA Tesla K20 GPUs

K20 GPUs and FDR InfiniBand


Providing many GPUs to applications with rCUDA

64 GPUs !!


Server consolidation with rCUDA

1

3

7

9

12

13

14

GPU utilization (%)

20 40 60 80 1000

off

off

off

off

off

off

off


Server consolidation with rCUDA

• The GPU-Blast application is migrated up to 5 times among K40 GPUs

• The aggregated volume of GPU data is 1300 MB (consisting of 9 memory regions)

The “Reference” line is the execution time of the application when using CUDA with a local GPU and without any migration

Loweris better

Heterogeneous2 environments

Outline


rCUDA availabilityrCUDA is available for the x86,

POWER and ARM processors



on POWER systems

Outline


From x86 to POWER with rCUDA


CUDA

rCUDArCUDA client rCUDA server

x86 @ 2.1 GHz

DDR3 1600 MHz

x86 @ 3.5 GHz

DDR4 2400 MHz

network fabric is

EDR InfiniBand

#1

#2

#3

Several testbeds used

rCUDA

CUDA

Performance of data movements to/from GPU


H2DD2H

Higher

is better

Higher

is better

Performance of data movements to/from GPU


P2P

Performance of data movements among GPUs

Higher

is better



rCUDA scenario 1

rCUDA scenario 2

rCUDA

CUDA



Higher is better


Several applications have been analyzed in this study:

1. BarraCUDA

2. CUDA-MEME

3. CUDASW++

4. GPU-Blast

5. Gromacs

6. GPU-LIBVSM

7. Magma

8. NAMD

Application performance


Unfortunately, we could not run all the applications in the Minsky system:

1. BarraCUDA: this application includes intrinsics headers

2. CUDA-MEME: successfully compiled and executed

3. CUDASW++: this application includes intrinsics headers

4. GPU-Blast: we were not able to compile it

5. Gromacs: successfully compiled and executed

6. GPU-LIBVSM: successfully compiled and executed

7. Magma: successfully compiled and executed

8. NAMD: we were not able to compile it




Lower

is better


Throughput instead of

performance

Outline


...

...

One rCUDA box serves multiple clients


1. BarraCUDA

2. CUDA-MEME

3. CUDASW++

4. GPU-Blast

5. Gromacs

6. Magma

Lower

is better



- 58%



GPU assigned but not used

GPU assigned but not used


on ARM systems

Outline


From ARM to x86 with rCUDA

ThunderX


Work in progress. A couple of applications have been already analyzed:

1. Cloverleaf: a mini-app that solves the compressible Euler equations

on a Cartesian grid

2. Flow: a mini-app that implements a 2D hydrodynamics simulator



Application performance: Cloverleaf

Lower

is better


Single node

executions

Estimation over

multiple nodes

Application performance: Cloverleaf

Lower

is better


Single node

executions

Estimation over

multiple nodes

ThunderX TDP = 80 watts

P100 TDP = 250 watts

Xeon TDP = 140 watts

40*80 versus 80+3*250+2*140

3200 watts versus 1110 watts

Application performance: Flow

Lower

is better


Single node

executions

Estimation over

multiple nodes

Application performance: Flow

Lower

is better


Single node

executions

Estimation over

multiple nodes

ThunderX TDP = 80 watts

P100 TDP = 250 watts

Xeon TDP = 140 watts

60*80 versus 80+3*250+2*140

4800 watts versus 1110 watts

Get a free copy of rCUDA at

http://www.rcuda.net

@rcuda_

More than 900 requests world wide

rCUDA is a development by Universitat Politècnica de València, Spain

·Tony Díaz · Pablo Higueras · Javier Prades · Jaime Sierra

· Cristian Peñaranda · Federico Silla · Carlos Reaño

rCUDA is a development by Universitat Politècnica de València, Spain

heterogeneous computing with rcuda - hpc …...basics of gpu computing remark: gpus can only be used...

Documents