mvapich2 and gpudirect rdma - hpc advisory … and gpudirect rdma dhabaleswar k. (dk) panda, hari...
TRANSCRIPT
MVAPICH2 and GPUDirect RDMA
Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri
The Ohio State University
E-mail: {panda, subramon, potluri}@cse.ohio-state.edu
https://mvapich.cse.ohio-state.edu/
Presentation at HPC Advisory Council, June 2013
by
DK-OSU HPC Advisory Council June'13 2
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
• 224 IB Clusters (44.8%) in the November 2012 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (16 systems):
DK-OSU HPC Advisory Council June'13
Large-scale InfiniBand Installations
147, 456 cores (Super MUC) in Germany (6th) 122,400 cores (Roadrunner) at LANL (22nd)
204,900 cores (Stampede) at TACC (7th) 53,504 (PRIMERGY) at Australia/NCI (24th)
77,184 cores (Curie thin nodes) at France/CEA (11th) 78,660 cores (Lomonosov) in Russia (26th )
120, 640 cores (Nebulae) at China/NSCS (12th) 137,200 cores (Sunway Blue Light) in China (28th)
72,288 cores (Yellowstone) at NCAR (13th) 46,208 cores (Zin) at LLNL (29th)
125,980 cores (Pleiades) at NASA/Ames (14th) 33,664 (MareNostrum) at Spain/BSC (36th)
70,560 cores (Helios) at Japan/IFERC (15th) 32,256 (SGI Altix X) at Japan/CRIEPI (39th)
73,278 cores (Tsubame 2.0) at Japan/GSIC (17th) More are getting installed !
138,368 cores (Tera-100) at France/CEA (20th)
3
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in
70 countries
– More than 173,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 204,900-core cluster (Stampede) at TACC
• 14th ranked 125,980-core cluster (Pleiades) at NASA
• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• and many others
– Available with software stacks of many IB, HSE and server vendors
including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
4
MVAPICH2/MVAPICH2-X Software
DK-OSU HPC Advisory Council June'13
• Released on 05/06/13
• Major Features and Enhancements
– Based on MPICH-3.0.3
• Support for all MPI-3 features
– Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)
• Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA
– Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
• Support for application-level checkpointing
• Support for hierarchical system-level checkpointing
– Scalable UD-multicast-based designs and tuned algorithm selection for collectives
– Improved and tuned MPI communication from GPU device memory
– Improved job startup time
• Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on
homogeneous clusters
– Revamped Build system with support for parallel builds
• MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.
– Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d
5
MVAPICH2 1.9 and MVAPICH2-X 1.9
DK-OSU HPC Advisory Council June'13
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 6
Outline
DK-OSU HPC Advisory Council June'13
DK-OSU HPC Advisory Council June'13 7
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
DK-OSU HPC Advisory Council June'13 8
Bandwidth: MPI over IB
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
0
5000
10000
15000
20000
25000 MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
• NAMD performance improves when there is frequent communication to many peers
DK-OSU HPC Advisory Council June'13
eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)
9
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• Runtime Parameters: RC - default;
• UD - MV2_USE_ONLY_UD=1
• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,”
Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16KMe
mo
ry (
MB
/pro
cess
)
Connections
MVAPICH2-RC
MVAPICH2-XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
No
rmal
ize
d T
ime
Dataset
MVAPICH2-RC MVAPICH2-XRC
0
2
4
6
128 256 512 1024
Tim
e (
us)
Number of Processes
UD Hybrid RC
26% 40% 30% 38%
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
10
Latest MVAPICH2 1.9
Intel Sandy-bridge
0.19 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
12,000MB/s 12,000MB/s
DK-OSU HPC Advisory Council June'13
• MVAPICH2/MVAPICH2-X Overview
– Highly Efficient Intra-node and Inter-node Communication and Scalable
Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
• MVAPICH2 with GPUDirect RDMA
• Conclusion 11
Outline
DK-OSU HPC Advisory Council June'13
DK-OSU HPC Advisory Council June'13 12
Hardware Multicast-aware MPI_Bcast on Stampede
05
10152025303540
2 8 32 128 512
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages (102,400 Cores)
Default
Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
050
100150200250300350400450
2K 8K 32K 128K
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages (102,400 Cores)
Default
Multicast
0
5
10
15
20
25
30
Late
ncy
(u
s)
Number of Nodes
16 Byte Message
Default
Multicast
0
50
100
150
200
Late
ncy
(u
s)
Number of Nodes
32 KByte Message
Default
Multicast
0
1
2
3
4
5
512 600 720 800Ap
plic
atio
n R
un
-Tim
e
(s)
Data Size
0
5
10
15
64 128 256 512Ru
n-T
ime
(s)
Number of Processes
PCG-Default Modified-PCG-Offload
Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload
13
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
All-to-All with Collective Offload on InfiniBand Clusters: A Study
with Parallel 3D FFT, ISC 2011
DK-OSU HPC Advisory Council June'13
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
No
rmal
ize
d
Pe
rfo
rman
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with
Collective Offload on InfiniBand Clusters: A Case Study
with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
• MVAPICH2/MVAPICH2-X Overview
– Highly Efficient Intra-node and Inter-node Communication and Scalable
Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
• MVAPICH2 with GPUDirect RDMA
• Conclusion 14
Outline
DK-OSU HPC Advisory Council June'13
Multi-Level Checkpointing with ScalableCR (SCR)
15 DK-OSU HPC Advisory Council June'13
Ch
eckp
oin
t C
ost
an
d R
esili
ency
Low
High
• LLNL’s Scalable Checkpoint/Restart
library
• Can be used for application guided and
application transparent checkpointing
• Effective utilization of storage hierarchy
– Local: Store checkpoint data on node’s local
storage, e.g. local disk, ramdisk
– Partner: Write to local storage and on a
partner node
– XOR: Write file to local storage and small sets
of nodes collectively compute and store parity
redundancy data (RAID-5)
– Stable Storage: Write to parallel file system
Application-guided Multi-Level Checkpointing
16 DK-OSU HPC Advisory Council June'13
0
20
40
60
80
100
PFS MVAPICH2+SCR(Local)
MVAPICH2+SCR(Partner)
MVAPICH2+SCR(XOR)
Ch
eckp
oin
t W
riti
ng
Tim
e (s
)
Representative SCR-Enabled Application
• Checkpoint writing phase times of representative SCR-enabled MPI application
• 512 MPI processes (8 procs/node)
• Approx. 51 GB checkpoints
Transparent Multi-Level Checkpointing
17 DK-OSU HPC Advisory Council June'13
0
2000
4000
6000
8000
10000
MVAPICH2-CR (PFS) MVAPICH2+SCR (Multi-Level)
Ch
eckp
oin
tin
g Ti
me
(ms)
Suspend N/W Reactivate N/W Write Checkpoint
• ENZO Cosmology application – Radiation Transport workload
• Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism
• 512 MPI processes (8 procs/node)
• Approx. 12.8 GB checkpoints
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 18
Outline
DK-OSU HPC Advisory Council June'13
DK-OSU HPC Advisory Council June'13 19
Scalable OpenSHMEM/UPC and Hybrid (MPI, UPC and OpenSHMEM) designs
Hybrid MPI+OpenSHMEM/UPC
• Based on OpenSHMEM Reference Implementation (http://openshmem.org/) & UPC version 2.14.2 (http://upc.lbl.gov/)
• Provides a design over GASNet
• Does not take advantage of all OFED features
• Design Scalable and High-Performance OpenSHMEM & UPC over OFED
• Designing a Hybrid MPI + OpenSHMEM/UPC Model
• Current Model – Separate Runtimes for OpenSHMEM/UPC and MPI
• Possible deadlock if both runtimes are not progressed
• Consumes more network resource
• Our Approach – Single Unified Runtime for MPI and OpenSHMEM/UPC
Available in
MVAPICH2-X 1.9
Hybrid (UPC/OpenSHMEM+MPI) Application
InfiniBand Network
Unified Communication Runtime
UPC/OpenSHMEM
InterfaceMPI
Interface
UPC/OpenSHMEM
callsMPI calls
DK-OSU HPC Advisory Council June'13 20
Performance of Hybrid (OpenSHMEM+MPI) Applications
0
400
800
1200
1600
32 64 128 256 512
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
34%
0
2
4
6
8
32 64 128 256
Tim
e (s
)
No. of Processes
Hybrid-GASNet
Hybrid-OSU
2D Heat Transfer Modeling Graph-500
0
20
40
60
80
32 64 128 256 512
Res
ou
rce
Uti
lizat
ion
(M
B)
No. of Processes
Hybrid-GASNetHybrid-OSU
27%
45%
Network Resource Usage
• Improved Performance for Hybrid
Applications
• 34% improvement for 2DHeat Transfer Modeling with 512 processes
• 45% improvement for Graph500 with 256 processes
• Our approach with single Runtime consumes 27% lesser network resources
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and
Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
DK-OSU HPC Advisory Council June'13 21
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 2,048 processes
- 1.9X improvement over MPI-CSR (best performing MPI version)
- 2.7X improvement over MPI-Simple (same communication characteristics)
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6 X improvement over MPI-Simple
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM
Programming Models, International Supercomputing Conference (ISC’13), June 2013 (Monday, Hall 5, 5:40 – 6:00 PM)
0
1
2
3
4
5
6
7
8
9
2048 8192
Tim
e (
s)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)
0.00E+00
1.00E+09
2.00E+09
3.00E+09
4.00E+09
5.00E+09
6.00E+09
7.00E+09
8.00E+09
9.00E+09
26 27 28 29
Trav
ers
ed
Ed
ges
Pe
r Se
con
d (
TEP
S)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0.00E+00
1.00E+09
2.00E+09
3.00E+09
4.00E+09
5.00E+09
6.00E+09
7.00E+09
8.00E+09
9.00E+09
1024 2048 4096 8192
Trav
ers
ed
Ed
ges
Pe
r Se
con
d (
TEP
S)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
2.4X 1.9X
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 22
Outline
DK-OSU HPC Advisory Council June'13
• Many systems today want to use systems
that have both GPUs and high-speed
networks such as InfiniBand
• Problem: Lack of a common memory
registration mechanism
– Each device has to pin the host memory it will
use
– Many operating systems do not allow
multiple devices to register the same
memory pages
• Previous solution:
– Use different buffer for each device and copy
data DK-OSU HPC Advisory Council June'13 23
InfiniBand + GPU systems (Past)
• Collaboration between Mellanox and
NVIDIA to converge on one memory
registration technique
• Both devices register a common
host buffer
– GPU copies data to this buffer, and the network
adapter can directly read from this buffer (or
vice-versa)
• Note that GPU-Direct does not allow you to
bypass host memory
24
GPU-Direct
DK-OSU HPC Advisory Council June'13
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(sbuf, sdev, . . .);
MPI_Send(sbuf, size, . . .);
At Receiver:
MPI_Recv(rbuf, size, . . .);
cudaMemcpy(rdev, rbuf, . . .);
Sample Code - Without MPI integration
• Naïve implementation with standard MPI and CUDA
• High Productivity and Poor Performance
25 DK-OSU HPC Advisory Council June'13
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(sbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
Sample Code – User Optimized Code
• Pipelining at user level with non-blocking MPI and CUDA interfaces
• Code at Sender side (and repeated at Receiver side)
• User-level copying may not match with internal MPI design
• High Performance and Poor Productivity
26 DK-OSU HPC Advisory Council June'13
Can this be done within MPI Library?
• Support GPU to GPU communication through standard MPI
interfaces
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory
• Provide high performance without exposing low level details
to the programmer
– Pipelined data transfer which automatically provides optimizations
inside MPI library without user tuning
• A new design was incorporated in MVAPICH2 to support this
functionality
27 DK-OSU HPC Advisory Council June'13
At Sender:
At Receiver:
MPI_Recv(r_device, size, …);
inside MVAPICH2
Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: standard MPI interfaces used
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
• High Performance and High Productivity
28
MPI_Send(s_device, size, …);
DK-OSU HPC Advisory Council June'13
MPI Two-sided Communication
• 45% improvement compared with a naïve user-level implementation
(Memcpy+Send), for 4MB messages
• 24% improvement compared with an advanced user-level implementation
(MemcpyAsync+Isend), for 4MB messages
0
500
1000
1500
2000
2500
3000
32K 64K 128K 256K 512K 1M 2M 4M
Tim
e (
us)
Message Size (bytes)
Memcpy+Send
MemcpyAsync+Isend
MVAPICH2-GPU
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11
29 DK-OSU HPC Advisory Council June'13
Better
30
Application-Level Evaluation (LBM and AWP-ODC)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 1D LBM-CUDA: one process/GPU per node, 16 nodes, 4 groups data grid
•AWP-ODC (Courtesy: Yifeng Cui, SDSC) • A seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
20
40
60
80
100
120
140
256*256*256 512*256*256 512*512*256 512*512*512
Ste
p T
ime
(S)
Domain Size X*Y*Z
MPI MPI-GPU
13.7% 12.0%
11.8%
9.4%
1D LBM-CUDA
DK-OSU HPC Advisory Council June'13
0102030405060708090
1 GPU/Proc per Node 2 GPUs/Procs per NodeTota
l Ex
ecu
tio
n T
ime
(se
c)
Configuration
AWP-ODC
MPI MPI-GPU
11.1% 7.9%
31
MPI_Alltoall
P2P Comm.
P2P Comm.
P2P Comm.
P2P Comm.
DMA: data movement from device
to host
RDMA: Data transfer to
remote node over network
DMA: data movement
from host to device
N2
Pipelined point-to-point communication
optimizes this
Need for optimization at the algorithm level
Optimizing Collective Communication
DK-OSU HPC Advisory Council June'13
0
2000
4000
6000
8000
10000
12000
14000
16000
128K 256K 512K 1M 2M
Tim
e (
us)
Message Size
No MPI Level Optimization
Collective Level Optimization
32
46%
Alltoall Latency Performance (Large Messages)
DK-OSU HPC Advisory Council June'13
A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011
• 8 node Westmere cluster with NVIDIA Tesla C2050 and IB QDR
Better
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 33
Outline
DK-OSU HPC Advisory Council June'13
Non-contiguous Data Exchange
34
• Multi-dimensional data
– Row based organization
– Contiguous on one dimension
– Non-contiguous on other
dimensions
• Halo data exchange
– Duplicate the boundary
– Exchange the boundary in each
iteration
Halo data exchange
DK-OSU HPC Advisory Council June'13
Datatype Support in MPI
• Native datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
DK-OSU HPC Advisory Council June'13 35
At Sender:
MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);
MPI_Type_commit(&new_type);
…
MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• What will happen if the non-contiguous data is in the GPU device memory?
• Enhanced MVAPICH2 • Use data-type specific CUDA Kernels to pack data in chunks
• Pipeline pack/unpack, CUDA copies, and RDMA transfers
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011
36
Application-Level Evaluation (LBMGPU-3D)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
0
50
100
150
200
250
300
350
400
8 16 32 64
Tota
l Exe
cuti
on
Tim
e (
sec)
Number of GPUs
MPI MPI-GPU5.6%
8.2%
13.5% 15.5%
3D LBM-CUDA
DK-OSU HPC Advisory Council June'13
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 37
Outline
DK-OSU HPC Advisory Council June'13
Multi-GPU Configurations
38
CPU
GPU 1 GPU 0
Memory
I/O Hub
Process 0 Process 1 • Multi-GPU node architectures are becoming common
• Until CUDA 3.2
– Communication between processes staged through the host
– Shared Memory (pipelined)
– Network Loopback [asynchronous)
• CUDA 4.0
– Inter-Process Communication (IPC)
– Host bypass
– Handled by a DMA Engine
– Low latency and Asynchronous
– Requires creation, exchange and mapping of memory handles - overhead
HCA
DK-OSU HPC Advisory Council June'13
39
Comparison of Costs
Co
py
Late
ncy
(u
sec)
50
100
150
200
CUDA IPC Copy
Copy Via Host
CUDA IPC Copy + Handle Creation &
Mapping Overhead
49 usec
3 usec
228 usec
• Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI)
• MVAPICH2 takes advantage of CUDA IPC while hiding the handle creation and mapping overheads from the user
DK-OSU HPC Advisory Council June'13
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
0
500
1000
1500
2000
4K 16K 64K 256K 1M 4M
Late
ncy
(u
sec)
Message Size (Bytes)
0
10
20
30
40
50
1 4 16 64 256 1K
Late
ncy
(u
sec)
Message Size (Bytes)
40
Two-sided Communication Performance
70% 46%
78% 0.0
10.0
20.0
30.0
40.0
1 4 16 64 256 1024
Late
ncy
(usec)
Message Size (Bytes)
SHARED-MEM CUDA IPC
DK-OSU HPC Advisory Council June'13
Already available in MVAPICH2 1.8 and 1.9
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Ban
dw
idth
(M
Bp
s)
Message Size (Bytes)
0
10
20
30
40
50
2 8 32 128 512
Late
ncy
(u
sec)
Message Size (Bytes)
41
One-sided Communication Performance (get + active synchronization vs. send/recv)
0
5
10
15
20
1 4 16 64 256 1K
Late
ncy
(usec)
Message Size (Bytes)
SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC
30%
27%
One-sided semantics harness better performance compared to two-sided semantics.
0
500
1000
1500
2000
4K 16K 64K 256K 1M 4M
Late
ncy
(u
sec)
Message Size (Bytes)
DK-OSU HPC Advisory Council June'13
Support for one-sided communication from GPUs will be available in future releases of MVAPICH2
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 42
Outline
DK-OSU HPC Advisory Council June'13
• OpenACC is gaining popularity
• Several sessions during GTC
• A set of compiler directives (#pragma)
• Offload specific loops or parallelizable sections in code onto accelerators #pragma acc region {
for(i = 0; i < size; i++) {
A[i] = B[i] + C[i];
} }
• Routines to allocate/free memory on accelerators buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);
• Supported for C, C++ and Fortran
• Huge list of modifiers – copy, copyout, private, independent, etc..
OpenACC
43 DK-OSU HPC Advisory Council June'13
• acc_malloc to allocate device memory – No changes to MPI calls
– MVAPICH2 detects the device pointer and optimizes data movement
– Delivers the same performance as with CUDA
Using MVAPICH2 with OpenACC 1.0
44
A = acc_malloc(sizeof(int) * N);
…...
#pragma acc parallel loop deviceptr(A) . . .
//compute for loop
MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);
……
acc_free(A);
DK-OSU HPC Advisory Council June'13
• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in
OpenACC 2.0 implementations
– MVAPICH2 will detect the device pointer and optimize communication
– Expected to deliver the same performance as with CUDA
Using MVAPICH2 with the new OpenACC 2.0
45 DK-OSU HPC Advisory Council June'13
A = malloc(sizeof(int) * N);
…...
#pragma acc data copyin(A) . . .
{
#pragma acc parallel loop . . .
//compute for loop
MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
……
free(A);
• MVAPICH2/MVAPICH2-X Overview
– Efficient Intra-node and Inter-node Communication and Scalable Protocols
– Scalable Blocking and Non-blocking Collective Communication
– High Performance Fault Tolerance Mechanisms
– Support for Hybrid MPI+PGAS Programming models
• MVAPICH2 for GPU Clusters
– Point-to-point Communication
– Collective Communication
– MPI Datatype processing
– Multi-GPU Configurations
– MPI + OpenACC
• MVAPICH2 with GPUDirect RDMA
• Conclusion 46
Outline
DK-OSU HPC Advisory Council June'13
• Fastest possible communication
between GPU and other PCI-E
devices
• Network adapter can directly
read/write data from/to GPU
device memory
• Avoids copies through the host
• Allows for better asynchronous
communication
GPU-Direct RDMA with CUDA 5.0
InfiniBand
GPU
GPU Memory
CPU
Chip set
System Memory
DK-OSU HPC Advisory Council June'13 47
• Preliminary driver for GPU-Direct is under work by NVIDIA
and Mellanox
• OSU has done an initial design of MVAPICH2 with the latest
GPU-Direct-RDMA Driver
– Hybrid design
– Takes advantage of GPU-Direct-RDMA for short messages
– Uses host-based buffered design in current MVAPICH2 for large
messages
– Alleviates Sandybridge chipset bottleneck
DK-OSU HPC Advisory Council June'13 48
Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA
MVAPICH2-GDR Alpha will be demonstrated at ISC’13 Exhibition Floor
(Mellanox Technologies, Booth #326)
DK-OSU HPC Advisory Council June'13 49
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Latency
0
5
10
15
20
25
30
35
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
0
100
200
300
400
500
600
700
800
900
1000
16K 64K 256K 1M 4M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
Better
Better
19.78
69 %
6.12
DK-OSU HPC Advisory Council June'13 50
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Uni-directional Bandwidth
0
100
200
300
400
500
600
700
800
900
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bandwidth
0
1000
2000
3000
4000
5000
6000
7000
16K 64K 256K 1M 4M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bandwidth
3x 26%
Bet
ter
Bet
ter
DK-OSU HPC Advisory Council June'13 51
Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA
Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1 4 16 64 256 1K 4K
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Small Message Bi-Bandwidth
3.13 x
0
2000
4000
6000
8000
10000
12000
16K 64K 256K 1M 4M
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
Large Message Bi-Bandwidth
51%
Bet
ter
Bet
ter
0
10
20
30
40
50
60
70
64 128 256Problem Size
Execution Time of HSG (2 MPI / Node)
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Tota
l Tim
e (
S.)
68% 57%
21%
0
10
20
30
40
50
60
64 128 256Problem Size
Execution Time of HSG (1 MPI / Node)
MVAPICH2-1.9
MVAPICH2-1.9-GDR
Tota
l Tim
e (
S.)
38%
DK-OSU HPC Advisory Council June'13 52
Execution Time of HSG Application
Based on MVAPICH2-1.9; Intel Sandy Bridge (E5-2670) node with 16 cores; NVIDIA Telsa K20c GPU,; Mellanox ConnectX-3 FDR HCA; CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
Better
Better
Application run with two GPU Nodes
• Benefits small and medium message exchanges – good for strong scaling
• Context switching overheads in cudaMemcpy with multiple Procs/GPU
• GDR avoids these overheads
DK-OSU HPC Advisory Council June'13 53
Execution Time of HSG Application (Cont’d)
Better
0
10
20
30
40
50
60
70
80
90
64 128 256Problem Size
Execution Time of HSG (4 MPI / Node)
MVAPICH2-1.9
MVAPICH2-1.9-GDRTo
tal T
ime
(S.
)
78% 74%
38%
Application run with two GPU Nodes
Based on MVAPICH2-1.9; Intel Sandy Bridge (E5-2670) node with 16 cores; NVIDIA Telsa K20c GPU,; Mellanox ConnectX-3 FDR HCA; CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch
MVAPICH2 Release with GPUDirect RDMA Hybrid
• Further tuning and optimizations (such as collectives) to be done
• MVAPICH2 release with GPUDirect RDMA support will be timed
with the release of OFED driver with GDR support by Mellanox
and NVIDIA.
54 DK-OSU HPC Advisory Council June'13
• MVAPICH2/MVAPICH2-X offer solutions to address
challenges on large-scale HPC clusters
• MVAPICH2 optimizes MPI communication on InfiniBand
clusters with GPUs
• Point-to-point, collective communication and datatype
processing are addressed
• Takes advantage of CUDA features like IPC and GPUDirect
RDMA
• Optimizations under the hood of MPI calls, hiding all the
complexity from the user
• High productivity and high performance
55
Conclusions
DK-OSU HPC Advisory Council June'13
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
56 DK-OSU HPC Advisory Council June'13
DK-OSU HPC Advisory Council June'13 57
MVAPICH User Group (MUG) Meeting August 26-27, 2013, Columbus, Ohio, U.S.A • The MUG meeting will provide an open forum for all attendees (users, researchers,
system administrators, engineers, and students) to share their knowledge about
MVAPICH2/MVAPICH2-X on large-scale systems and diverse applications.
• The event includes:
• Talks from experts in the field
• Presentations from the MVAPICH team on tuning and optimization strategies
• Troubleshooting guidelines
• Contributed presentations
• Open mic session
• Interactive one-on-one session with the MVAPICH developers
Call for Presentation
• The MVAPICH team is requesting the submission of presentations from MVAPICH2 and
MVAPICH2-X users to be included in the event.
Presentation Submission Deadline: July 1, 2013
Notification of Acceptance: July 8, 2013
Advanced Registration Deadline: July 15, 2013
The preliminary program has been posted at mug.mvapich.cse.ohio-state.edu/program/