heterogeneous computing with rcuda - hpc …...basics of gpu computing remark: gpus can only be used...
TRANSCRIPT
Heterogeneous2 Computing
with rCUDA
Federico Silla
Universitat Politècnica de València
HPC Advisory Council Swiss Conference 2018
- What is heterogeneous2 computing?
- What is rCUDA?F. Silla @ HPC Advisory Council Swiss Conference 2018
Heterogeneous2 Computing
with rCUDA
What is rCUDA?
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
Basics of GPU computing
Remark:
GPUs can only be used within
the node they are attached to
Basic behavior of CUDA
GPU
F. Silla @ HPC Advisory Council Swiss Conference 2018
Basics of GPU computing
Remark:
GPUs can only be used within
the node they are attached to
Basic behavior of CUDA
GPU
F. Silla @ HPC Advisory Council Swiss Conference 2018
A different approach: remote GPU virtualization
No GPU
F. Silla @ HPC Advisory Council Swiss Conference 2018
A different approach: remote GPU virtualization
A software technology that enables a more flexible
use of GPUs in computing facilities
No GPU
rCUDA is a development by Universitat Politècnica de València
rCUDA … remote CUDA
Access to remote GPU is
transparent to applications:
no source code
modification is needed
Basics or rCUDA
rCUDA is a development by Universitat Politècnica de València
Basics or rCUDA
rCUDA is a development by Universitat Politècnica de València
Access to remote GPU is
transparent to applications:
no source code
modification is needed
Physical
configuration
rCUDA allows a new vision of a GPU
deployment, moving from the usual cluster
configuration …
… to the following one:
Logical
configuration
rCUDA envision
Performance of rCUDA
Loweris better
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
F. Silla @ HPC Advisory Council Swiss Conference 2018
Performance of rCUDA
CUDA-MEME
BarraCUDA
Loweris better
Loweris betterP100 GPU and EDR InfiniBand
F. Silla @ HPC Advisory Council Swiss Conference 2018
Benefits of rCUDA?
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
Providing many GPUs to applications with rCUDA
Loweris better
MonteCarlo multi-GPU program running in 14 NVIDIA Tesla K20 GPUs
K20 GPUs and FDR InfiniBand
F. Silla @ HPC Advisory Council Swiss Conference 2018
Providing many GPUs to applications with rCUDA
64 GPUs !!
F. Silla @ HPC Advisory Council Swiss Conference 2018
Server consolidation with rCUDA
1
3
7
9
12
13
14
GPU utilization (%)
20 40 60 80 1000
off
off
off
off
off
off
off
F. Silla @ HPC Advisory Council Swiss Conference 2018
Server consolidation with rCUDA
• The GPU-Blast application is migrated up to 5 times among K40 GPUs
• The aggregated volume of GPU data is 1300 MB (consisting of 9 memory regions)
The “Reference” line is the execution time of the application when using CUDA with a local GPU and without any migration
Loweris better
Heterogeneous2 environments
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
rCUDA availabilityrCUDA is available for the x86,
POWER and ARM processors
F. Silla @ HPC Advisory Council Swiss Conference 2018
Performance of rCUDA
on POWER systems
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
From x86 to POWER with rCUDA
F. Silla @ HPC Advisory Council Swiss Conference 2018
CUDA
rCUDArCUDA client rCUDA server
x86 @ 2.1 GHz
DDR3 1600 MHz
x86 @ 3.5 GHz
DDR4 2400 MHz
network fabric is
EDR InfiniBand
#1
#2
#3
Several testbeds used
rCUDA
CUDA
Performance of data movements to/from GPU
F. Silla @ HPC Advisory Council Swiss Conference 2018
H2DD2H
Higher
is better
Higher
is better
Performance of data movements to/from GPU
F. Silla @ HPC Advisory Council Swiss Conference 2018
P2P
Performance of data movements among GPUs
Higher
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Performance of data movements among GPUs
rCUDA scenario 1
rCUDA scenario 2
rCUDA
CUDA
F. Silla @ HPC Advisory Council Swiss Conference 2018
Performance of data movements among GPUs
Higher is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Several applications have been analyzed in this study:
1. BarraCUDA
2. CUDA-MEME
3. CUDASW++
4. GPU-Blast
5. Gromacs
6. GPU-LIBVSM
7. Magma
8. NAMD
Application performance
F. Silla @ HPC Advisory Council Swiss Conference 2018
Unfortunately, we could not run all the applications in the Minsky system:
1. BarraCUDA: this application includes intrinsics headers
2. CUDA-MEME: successfully compiled and executed
3. CUDASW++: this application includes intrinsics headers
4. GPU-Blast: we were not able to compile it
5. Gromacs: successfully compiled and executed
6. GPU-LIBVSM: successfully compiled and executed
7. Magma: successfully compiled and executed
8. NAMD: we were not able to compile it
Application performance
F. Silla @ HPC Advisory Council Swiss Conference 2018
Application performance
Lower
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Throughput instead of
performance
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
...
...
One rCUDA box serves multiple clients
F. Silla @ HPC Advisory Council Swiss Conference 2018
1. BarraCUDA
2. CUDA-MEME
3. CUDASW++
4. GPU-Blast
5. Gromacs
6. Magma
Lower
is better
One rCUDA box serves multiple clients
F. Silla @ HPC Advisory Council Swiss Conference 2018
- 58%
One rCUDA box serves multiple clients
F. Silla @ HPC Advisory Council Swiss Conference 2018
GPU assigned but not used
GPU assigned but not used
Performance of rCUDA
on ARM systems
Outline
F. Silla @ HPC Advisory Council Swiss Conference 2018
From ARM to x86 with rCUDA
ThunderX
F. Silla @ HPC Advisory Council Swiss Conference 2018
Work in progress. A couple of applications have been already analyzed:
1. Cloverleaf: a mini-app that solves the compressible Euler equations
on a Cartesian grid
2. Flow: a mini-app that implements a 2D hydrodynamics simulator
Application performance
F. Silla @ HPC Advisory Council Swiss Conference 2018
Application performance: Cloverleaf
Lower
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Single node
executions
Estimation over
multiple nodes
Application performance: Cloverleaf
Lower
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Single node
executions
Estimation over
multiple nodes
ThunderX TDP = 80 watts
P100 TDP = 250 watts
Xeon TDP = 140 watts
40*80 versus 80+3*250+2*140
3200 watts versus 1110 watts
Application performance: Flow
Lower
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Single node
executions
Estimation over
multiple nodes
Application performance: Flow
Lower
is better
F. Silla @ HPC Advisory Council Swiss Conference 2018
Single node
executions
Estimation over
multiple nodes
ThunderX TDP = 80 watts
P100 TDP = 250 watts
Xeon TDP = 140 watts
60*80 versus 80+3*250+2*140
4800 watts versus 1110 watts
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 900 requests world wide
rCUDA is a development by Universitat Politècnica de València, Spain
·Tony Díaz · Pablo Higueras · Javier Prades · Jaime Sierra
· Cristian Peñaranda · Federico Silla · Carlos Reaño
rCUDA is a development by Universitat Politècnica de València, Spain