rCUDA:
desde máquinas virtuales a
clústers mixtos CPU-GPU
Federico Silla
Universitat Politècnica de València
HPC ADMINTECH 2018
rCUDA:
from virtual machines to
hybrid CPU-GPU clusters
HPC ADMINTECH 2018
Federico Silla
Universitat Politècnica de València
What is rCUDA?
Outline
F. Silla @ HPC ADMINTECH 2018
Basics of GPU computing
Remark:
GPUs can only be used within
the node they are attached to
Basic behavior of CUDA
GPU
F. Silla @ HPC ADMINTECH 2018
Remark:
GPUs can only be used within
the node they are attached to
GPU
Basics of GPU computing
Basic behavior of CUDA
F. Silla @ HPC ADMINTECH 2018
No GPU
A different approach: remote GPU virtualization
F. Silla @ HPC ADMINTECH 2018
A software technology that enables a more flexible
use of GPUs in computing facilities
rCUDA is a development by Universitat Politècnica de València
rCUDA … remote CUDA
A different approach: remote GPU virtualization
No GPU
Access to remote GPU is
transparent to applications:
no source code
modification is needed
rCUDA is a development by Universitat Politècnica de València
Basics or rCUDA
rCUDA is a development by Universitat Politècnica de València
Basics or rCUDA
Access to remote GPU is
transparent to applications:
no source code
modification is needed
rCUDA is a development by Universitat Politècnica de València
Basics or rCUDA
Access to remote GPU is
transparent to applications:
no source code
modification is needed
F. Silla @ HPC ADMINTECH 2018
rCUDA supports RDMA transfers
Physical
configuration
rCUDA allows a new vision of
a GPU deployment, moving from
the usual cluster configuration …
… to the following one:
Logical
configuration
F. Silla @ HPC ADMINTECH 2018
rCUDA envision
Perfomance of rCUDA?
Outline
F. Silla @ HPC ADMINTECH 2018
Loweris better
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
F. Silla @ HPC ADMINTECH 2018
Performance of rCUDA
CUDA-MEMEBarraCUDA
Loweris better
Loweris betterP100 GPU and EDR InfiniBand
F. Silla @ HPC ADMINTECH 2018
Performance of rCUDA
rCUDA scenario 1
rCUDA scenario 2
rCUDACUDA
F. Silla @ HPC ADMINTECH 2018
Performance of data movements among GPUs
Higher is better
F. Silla @ HPC ADMINTECH 2018
Performance of data movements among GPUs
rCUDACUDA
F. Silla @ HPC ADMINTECH 2018
Performance of data movements to/from GPUs
CPU to GPU
GPU to CPU
Higheris better
F. Silla @ HPC ADMINTECH 2018
Performance of data movements to/from GPUs
Higheris better
F. Silla @ HPC ADMINTECH 2018
Performance of data movements to/from GPUs
CPU to GPU
GPU to CPU
Higheris better
F. Silla @ HPC ADMINTECH 2018
Performance of data movements to/from GPUs
CPU to GPU
GPU to CPU
New communication module in progress
F. Silla @ HPC ADMINTECH 2018
Performance of data movements to/from GPUs
Benefits of rCUDA?
Outline
F. Silla @ HPC ADMINTECH 2018
Benefits of rCUDA:
1. Many GPUs for an application
2. Server consolidation
3. GPU acceleration for virtual machines
4. Increased cluster throughput
Outline
F. Silla @ HPC ADMINTECH 2018
F. Silla @ HPC ADMINTECH 2018
Providing many GPUs to an application with rCUDA
Loweris better
MonteCarlo multi-GPU program running in 14 NVIDIA Tesla K20 GPUs
K20 GPUs and FDR InfiniBand
F. Silla @ HPC ADMINTECH 2018
Providing many GPUs to an application with rCUDA
64 GPUs !!
F. Silla @ HPC ADMINTECH 2018
Providing many GPUs to an application with rCUDA
K20 GPUs
Work in progress!!
non-optimized (yet) version of rCUDA!!!
GPU
1
GPU
2
GPU
3
GPU
4
GPU
5GPU
6
GPU
7
GPU
8
GPU
9
GPU
10
GPU
11GPU
12
GPU
13
GPU
14
GPU
15
GPU
16
F. Silla @ HPC ADMINTECH 2018
Providing many GPUs to an application with rCUDA
1
3
7
9
12
13
14
GPU utilization (%)
20 40 60 80 1000
off
off
off
off
off
off
off
F. Silla @ HPC ADMINTECH 2018
Server consolidation with rCUDA
• The GPU-Blast application is migrated up to 5 times among K40 GPUs
• The aggregated volume of GPU data is 1300 MB (consisting of 9 memory regions)
The “Reference” line is the execution time of the application when using CUDA with a local GPU and without any migration
Loweris better
F. Silla @ HPC ADMINTECH 2018
Server consolidation with rCUDA
• How to access the GPU in the native domain from inside of virtual machines?
F. Silla @ HPC ADMINTECH 2018
Virtual machines may need access to GPUs
• The GPU is assigned by using PCI passthrough exclusively to a single virtual machine
• Concurrent usage of the GPU is not possible
F. Silla @ HPC ADMINTECH 2018
Virtual machines may need access to GPUs
• If InfiniBand is available, the rCUDA server can be placed in another node
• Several GPUs can be provided to the VMs, either in a single remote node or in several remote nodes
High performance network fabric available
F. Silla @ HPC ADMINTECH 2018
Using rCUDA to access the GPU
• When InfiniBand is not available, the rCUDA server may be placed in the native domain and the rCUDAclient would be placed inside the VMs
High performance network is not
available
• The virtual network provided by the hypervisor would be used to exchange data between the rCUDA clients and the rCUDA server
• This configuration allows the use of more than one GPU at the host
F. Silla @ HPC ADMINTECH 2018
Using rCUDA to access the GPU
F. Silla @ HPC ADMINTECH 2018
Using rCUDA to access the GPU
...
...
One rCUDA box serves multiple clients
F. Silla @ HPC ADMINTECH 2018
Increased cluster throughput
1. BarraCUDA
2. CUDA-MEME
3. CUDASW++
4. GPU-Blast
5. Gromacs
6. Magma
Lower
is better
- 58%
F. Silla @ HPC ADMINTECH 2018
Increased cluster throughput
GPU assigned but not used
GPU assigned but not used
F. Silla @ HPC ADMINTECH 2018
Increased cluster throughput
One more benefit:
Heterogeneous2 environments
Outline
F. Silla @ HPC ADMINTECH 2018
rCUDA is available for the x86,
POWER and ARM processors
F. Silla @ HPC ADMINTECH 2018
rCUDA availability
Performance of rCUDA
on ARM systems
Outline
F. Silla @ HPC ADMINTECH 2018
ThunderX
F. Silla @ HPC ADMINTECH 2018
From ARM to x86 with rCUDA
Work in progress. A couple of applications have been already
analyzed:
1. Cloverleaf: a mini-app that solves the compressible Euler
equations on a Cartesian grid
2. Flow: a mini-app that implements a 2D hydrodynamics
simulator
F. Silla @ HPC ADMINTECH 2018
Application performance
Lower
is
better
Single node
executions
Estimation over
multiple nodes
F. Silla @ HPC ADMINTECH 2018
Application performance: Cloverleaf
Rough energy estimation:
ThunderX TDP = 80 watts
P100 TDP = 250 watts
Xeon TDP = 140 watts
40*80 versus 1*80+3*250+2*140
3200 watts versus 1110 watts
Application performance: Cloverleaf
Lower
is
better
Single node
executions
Estimation over
multiple nodes
F. Silla @ HPC ADMINTECH 2018
Lower
is
better
Single node
executions
F. Silla @ HPC ADMINTECH 2018
Application performance: Flow
Estimation over
multiple nodes
Rough energy estimation:
ThunderX TDP = 80 watts
P100 TDP = 250 watts
Xeon TDP = 140 watts
60*80 versus 1*80+3*250+2*140
4800 watts versus 1110 watts
Application performance: Flow
Lower
is
better
Single node
executions
F. Silla @ HPC ADMINTECH 2018
Estimation over
multiple nodes
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
Hig
h d
en
sit
y A
RM
-ba
se
d n
od
es
F. Silla @ HPC ADMINTECH 2018
Hybrid CPU-GPU clusters
F. Silla @ HPC ADMINTECH 2018
Hybrid CPU-GPU clusters
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
Hig
h d
en
sit
y A
RM
-ba
se
d n
od
es
rCUDA serversrCUDA clients
F. Silla @ HPC ADMINTECH 2018
Hybrid CPU-GPU clusters
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
No GPU
Hig
h d
en
sit
y A
RM
-ba
se
d n
od
es
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 900 requests world wide
rCUDA is a development by Universitat Politècnica de València, Spain
·Tony Díaz · Pablo Higueras · Javier Prades · Jaime Sierra
· Cristian Peñaranda · Federico Silla · Carlos Reaño
rCUDA is a development by Universitat Politècnica de València, Spain