high-performance and scalable designs of programming ......• high performance open-source mpi...
TRANSCRIPT
High-Performance and Scalable Designs of Programming Models for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Switzerland Conference (2015)
by
High-End Computing (HEC): PetaFlop to ExaFlop
HPCAC Switzerland Conference (Mar '15) 2
100-200
PFlops in
2016-2017
1 EFlops in
2020-2024?
HPCAC Switzerland Conference (Mar '15) 3
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
86%
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters •
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth
Tianhe – 2 Titan Stampede Tianhe – 1A
4
Multi-core Processors
HPCAC Switzerland Conference (Mar '15)
• 225 IB Clusters (45%) in the November 2014 Top500 list
(http://www.top500.org)
• Installations in the Top 50 (21 systems):
HPCAC Switzerland Conference (Mar '15)
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (7th) 73,584 cores (Spirit) at USA/Air Force (31st)
72,800 cores Cray CS-Storm in US (10th) 77,184 cores (Curie thin nodes) at France/CEA (33rd)
160,768 cores (Pleiades) at NASA/Ames (11th) 65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (34th)
72,000 cores (HPC2) in Italy (12th) 120, 640 cores (Nebulae) at China/NSCS (35th)
147,456 cores (Super MUC) in Germany (14th) 72,288 cores (Yellowstone) at NCAR (36th)
76,032 cores (Tsubame 2.5) at Japan/GSIC (15th) 70,560 cores (Helios) at Japan/IFERC (38th)
194,616 cores (Cascade) at PNNL (18th) 38,880 cores (Cartesisu) at Netherlands/SURFsara (45th)
110,400 cores (Pangea) at France/Total (20th) 23,040 cores (QB-2) at LONI/US (46th)
37,120 cores T-Platform A-Class at Russia/MSU (22nd) 138,368 cores (Tera-100) at France/CEA (47th)
50,544 cores Occigen at France/GENCI-CINES (26th) and many more!
5
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
6
Two Major Categories of Applications
HPCAC Switzerland Conference (Mar '15)
Towards Exascale System (Today and Target)
Systems 2015 Tianhe-2
2020-2024 Difference Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW (3 Gflops/W)
~20 MW (50 Gflops/W)
O(1) ~15x
System memory 1.4 PB (1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s (0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU + 171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M 12.48M threads (4 /core)
O(billion) for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra
7 HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15) 8
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• Energy and Power Challenge
– Hard to solve power requirements for data movement
• Memory and Storage Challenge
– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
– Low voltage devices (for low power) introduce more faults
HPCAC Switzerland Conference (Mar '15) 9
Basic Design Challenges for Exascale Systems
• Power required for data movement operations is one of
the main challenges
• Non-blocking collectives
– Overlap computation and communication
• Much improved One-sided interface
– Reduce synchronization of sender/receiver
• Manage concurrency
– Improved interoperability with PGAS (e.g. UPC, Global Arrays,
OpenSHMEM, CAF)
• Resiliency
– New interface for detecting failures
HPCAC Switzerland Conference (Mar '15) 10
How does MPI Plan to Meet Exascale Challenges?
• Major features
– Non-blocking Collectives
– Improved One-Sided (RMA) Model
– MPI Tools Interface
• Specification is available from: http://www.mpi-
forum.org/docs/mpi-3.0/mpi30-report.pdf
• MPI 3.1 and 4.0 are underway
HPCAC Switzerland Conference (Mar '15) 11
Major New Features in MPI-3
12
Partitioned Global Address Space (PGAS) Models
HPCAC Switzerland Conference (Mar '15)
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
- Libraries
• OpenSHMEM
• Global Arrays
• Chapel
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model
– MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared
memory programming models
• Can co-exist with OpenMP for offloading computation
• Applications can have kernels with different communication patterns
• Can benefit from different models
• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead
HPCAC Switzerland Conference (Mar '15) 13
MPI+PGAS for Exascale Architectures and Applications
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
HPCAC Switzerland Conference (Mar '15) 14
Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-core Architectures
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Communication Library or Runtime for Programming Models
15
Accelerators (NVIDIA and MIC)
Middleware
HPCAC Switzerland Conference (Mar '15)
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)
• Virtualization
Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale
16 HPCAC Switzerland Conference (Mar '15)
• Extreme Low Memory Footprint – Memory per core continues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …)
• Network topology for processes for a given job
• Node architecture
• Health of network and node
– Learn
• Impact on performance and scalability
• Potential for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance solutions
– Low overhead techniques while delivering performance, scalability and fault-tolerance
Additional Challenges for Designing Exascale Software Libraries
17 HPCAC Switzerland Conference (Mar '15)
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over
Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Used by more than 2,350 organizations in 75 countries
– More than 242,000 downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘14 ranking)
• 7th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 160,768-core cluster (Pleiades) at NASA
• 15th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many IB, HSE, and server vendors including Linux
Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
Stampede at TACC (7th in Nov 2014, 519,640 cores, 5.168 Plops)
MVAPICH2 Software
18 HPCAC Switzerland Conference (Mar '15)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale
19 HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15)
Latency & Bandwidth: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00 Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB--DualFDRHaswell-ConnectIB-SingleFDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09 1.12
1.22
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Single FDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch
20
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
6304
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
21
Latest MVAPICH2 2.1rc2
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
HPCAC Switzerland Conference (Mar '15)
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
22
Latest MVAPICH2 2.1rc2, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
HPCAC Switzerland Conference (Mar '15)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put6881
6876
14926
15364
• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
• NAMD performance improves when there is frequent communication to many peers
HPCAC Switzerland Conference (Mar '15)
Minimizing Memory Footprint with XRC and Hybrid Mode
Memory Usage Performance on NAMD (1024 cores)
23
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• Runtime Parameters: RC - default;
• UD - MV2_USE_ONLY_UD=1
• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable
Connection,” Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16KMe
mo
ry (
MB
/pro
cess
)
Connections
MVAPICH2-RC
MVAPICH2-XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
No
rmal
ize
d T
ime
Dataset
MVAPICH2-RC MVAPICH2-XRC
0
2
4
6
128 256 512 1024
Tim
e (
us)
Number of Processes
UD Hybrid RC
26% 40% 30% 38%
HPCAC Switzerland Conference (Mar '15) 24
Minimizing Memory Footprint further by DC Transport
No
de
0
P1
P0
Node 1
P3
P2
Node 3
P7
P6
No
de
2
P5
P4
IB Network
• Constant connection cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC Initiator) and receive (DC Target)
– DC Target identified by “DCT Number”
– Messages routed with (DCT Number, LID)
– Requires same “DC Key” to enable communication
• Initial study done in MVAPICH2
• DCT support available in Mellanox OFED
0
0.5
1
160 320 620
No
rmal
ize
d E
xecu
tio
n
Tim
e
Number of Processes
NAMD - Apoa1: Large data set RC DC-Pool UD XRC
10 22
47 97
1 1 1 2
10 10 10 10
1 1
3 5
1
10
100
80 160 320 640
Co
nn
ect
ion
Me
mo
ry (
KB
)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool
UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected
Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
25 HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15) 26
Hardware Multicast-aware MPI_Bcast on Stampede
05
10152025303540
2 8 32 128 512
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages (102,400 Cores)
Default
Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
050
100150200250300350400450
2K 8K 32K 128K
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages (102,400 Cores)
Default
Multicast
0
5
10
15
20
25
30
Late
ncy
(u
s)
Number of Nodes
16 Byte Message
Default
Multicast
0
50
100
150
200
Late
ncy
(u
s)
Number of Nodes
32 KByte Message
Default
Multicast
0
1
2
3
4
5
512 600 720 800Ap
plic
atio
n R
un
-Tim
e
(s)
Data Size
0
5
10
15
64 128 256 512Ru
n-T
ime
(s)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Hardware
27
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
All-to-All with Collective Offload on InfiniBand Clusters: A Study
with Parallel 3D FFT, ISC 2011
HPCAC Switzerland Conference (Mar '15)
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
No
rmal
ize
d
Pe
rfo
rman
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective
Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
HPCAC Switzerland Conference (Mar '15) 28
Network-Topology-Aware Placement of Processes
How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?
Overall performance and Split up of physical communication for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable
InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper
Finalist
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores
• 15% improvement in Hypre execution time @ 1024 cores
29
Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison: MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for
Infiniband Clusters”, ICPP ‘10
CPMD Application Performance and Power Savings
5% 32%
HPCAC Switzerland Conference (Mar '15)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
30 HPCAC Switzerland Conference (Mar '15)
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• Data movement in applications with standard MPI and CUDA interfaces
High Productivity and Low Performance
31 HPCAC Switzerland Conference (Mar '15)
MPI + CUDA - Naive
PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,
…);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
• Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Productivity and High Performance
32 HPCAC Switzerland Conference (Mar '15)
MPI + CUDA - Advanced
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
33 HPCAC Switzerland Conference (Mar '15)
GPU-Aware MPI Library: MVAPICH2-GPU
CUDA-Aware MPI: MVAPICH2 1.8-2.1 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
34 HPCAC Switzerland Conference (Mar '15)
• OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
• OSU has a design of MVAPICH2 using GPUDirect
RDMA
– Hybrid design using GPU-Direct RDMA
• GPUDirect RDMA and Host-based pipelining
• Alleviates P2P bandwidth bottlenecks on
SandyBridge and IvyBridge
– Support for communication using multi-rail
– Support for Mellanox Connect-IB and ConnectX
VPI adapters
– Support for RoCE with Mellanox ConnectX VPI
adapters
HPCAC Switzerland Conference (Mar '15) 35
GPU-Direct RDMA (GDR) with CUDA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
36
Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency
HPCAC Switzerland Conference (Mar '15)
MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.1a
MV2-GDR2.0b
MV2 w/o GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
2.5X
8.6X
2.37 usec
37
Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth
HPCAC Switzerland Conference (Mar '15)
MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.1a
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bandwidth
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
10x
2.2X
38
Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth
HPCAC Switzerland Conference (Mar '15)
MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
0
500
1000
1500
2000
2500
3000
3500
4000
1 4 16 64 256 1K 4K
MV2-GDR2.1a
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bi-Bandwidth
Message Size (bytes)
Bi-
Ban
dw
idth
(M
B/s
)
10x
2x
39
Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device )
HPCAC Switzerland Conference (Mar '15)
MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
MPI-3 RMA provides flexible synchronization and completion primitives
2.88 us
7X
0
5
10
15
20
25
30
35
0 2 8 32 128 512 2K 8K
MV2-GDR2.1a
MV2-GDR2.0b
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPCAC Switzerland Conference (Mar '15) 40
Application-Level Evaluation (HOOMD-blue)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0
MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768
MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
0
1000
2000
3000
4000
4 8 16 32 64
Ave
rage
TP
S
Number of GPU Nodes
Weak Scalability with 2K Particles/GPU
0
200
400
600
800
1000
1200
1400
1600
1800
2000
4 8 16 32 64
Ave
rage
TP
S
Number of GPU Nodes
Strong Scalability with 256K Particles
• MVAPICH2-2.1a with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/#mv2gdr/
• System software requirements
• Mellanox OFED 2.1 or later
• NVIDIA Driver 331.20 or later
• NVIDIA CUDA Toolkit 6.0 or later
• Plugin for GPUDirect RDMA
– http://www.mellanox.com/page/products_dyn?product_family=116
– Strongly Recommended : use the new GDRCOPY module from NVIDIA
• https://github.com/NVIDIA/gdrcopy
• Has optimized designs for point-to-point communication using GDR
• Contact MVAPICH help list with any questions related to the package
Using MVAPICH2-GPUDirect Version
41 HPC Advisory Council Stanford Workshop (Feb '15)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
42 HPCAC Switzerland Conference (Mar '15)
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
43 HPCAC Switzerland Conference (Mar '15)
Data Movement on Intel Xeon Phi Clusters
CPU CPU QPI
MIC
PC
Ie
MIC
MIC
CPU
MIC
IB
Node 0 Node 1 1. Intra-Socket
2. Inter-Socket
3. Inter-Node
4. Intra-MIC
5. Intra-Socket MIC-MIC
6. Inter-Socket MIC-MIC
7. Inter-Node MIC-MIC
8. Intra-Socket MIC-Host
10. Inter-Node MIC-Host
9. Inter-Socket MIC-Host
MPI Process
44
11. Inter-Node MIC-MIC with IB adapter on remote socket
and more . . .
• Critical for runtimes to optimize data movement, hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
HPCAC Switzerland Conference (Mar '15)
45
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only
• Symmetric Mode
• Multi-MIC Node Configurations
HPCAC Switzerland Conference (Mar '15)
MIC-Remote-MIC P2P Communication with Proxy-based Communication
46 HPCAC Switzerland Conference (Mar '15)
Bandwidth
Bette
r Bet
ter
Bet
ter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0200040006000800010000120001400016000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
Bette
r 5594
Bandwidth
47
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
HPCAC Switzerland Conference (Mar '15)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M) Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Allgather (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
Exe
cuti
on
Tim
e (
secs
)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
48
Latest Status on MVAPICH2-MIC
HPCAC Switzerland Conference (Mar '15)
• Running on three major systems
– Stampede
– Blueridge(Virginia Tech)
– Beacon (UTK)
• MVAPICH2-MIC 2.0 was released (12/02/14)
• Try it out!!
– http://mvapich.cse.ohio-state.edu/downloads/#mv2mic
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
49 HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15) 50
MPI Tools Interface
• Introduced to expose internals of MPI tools and applications
• Generalized interface – no defined variables in the standard
• Variables can differ between
• MPI implementations
• Compilations of same MPI library (production vs debug)
• Executions of the same application/MPI library
• There could be no variables provided
• Two types of variables supported
• Control Variables (CVARS)
• Typically used to configure and tune MPI internals
• Environment variables, configuration parameters and toggles
• Performance Variables (PVARS)
• Insights into performance of an MPI library
• Highly-implementation specific
• Memory consumption, timing information, resource-usage, data transmission info.
• Per-call basis or an entire MPI job
MPI_T_init_thread()
MPI_T_cvar_get_info(MV2_EAGER_THRESHOLD)
if (msg_size < MV2_EAGER_THRESHOLD + 1KB)
MPI_T_cvar_write(MV2_EAGER_THRESHOLD, +1024)
MPI_Send(..)
MPI_T_finalize()
HPCAC Switzerland Conference (Mar '15) 51
Co-designing Applications to use MPI-T
Initialize MPI-T
Get #variables
Query Metadata
Allocate Session
Allocate Handle
Read/Write/Reset
Start/Stop var
Free Handle
Finalize MPI-T
Free Session
Allocate Handle
Read/Write var
Free Handle
Performance
Variables
Control
Variables
Example: Optimizing the eager limit dynamically ->
R. Rajachan, J. Perkins, K. Hamidouche, M. Arnold, and D. K. Panda, Understanding the Memory-Utilization of MPI Libray : Challenges and
Designs in Implementing MPI_T interface. EuroMPI’14
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
52 HPCAC Switzerland Conference (Mar '15)
MVAPICH2-X for Hybrid MPI + PGAS Applications
HPCAC Switzerland Conference (Mar '15)
MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)
Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with
MVAPICH2-X 1.9 (2011) onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)
– Scalable Inter-node and intra-node communication – point-to-point and collectives
CAF Calls
53
OpenSHMEM Application Evaluation
54
0
100
200
300
400
500
600
256 512
Tim
e (
s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
0
10
20
30
40
50
60
70
80
256 512
Tim
e (
s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
Heat Image Daxpy
• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA
• Execution time for 2DHeat Image at 512 processes (sec):
- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-SHMEM – 169
• Execution time for DAXPY at 512 processes (sec):
- UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-SHMEM – 5.2
J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries
on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15) 55
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
0
1
2
3
4
5
6
7
8
9
26 27 28 29
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0
1
2
3
4
5
6
7
8
9
1024 2048 4096 8192
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
56
Hybrid MPI+OpenSHMEM Sort Application Execution Time
Strong Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• Execution Time
- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds
- 51% improvement over MPI-based design
• Strong Scalability (configuration: constant input size of 500GB)
- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min
- 55% improvement over MPI based design
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
512 1,024 2,048 4,096
Sort
Rat
e (
TB/m
in)
No. of Processes
MPI
Hybrid
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (
seco
nd
s)
Input Data - No. of Processes
MPI
Hybrid
51% 55%
HPCAC Switzerland Conference (Mar '15)
J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. K. Panda, Designing Scalable Out-of-core
Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14, Oct 2014
MVAPICH2-X Support of OpenUH CAF
• Support multiple networks through different conduits: MVAPICH2-X Conduit
is available in MVAPICH2-X release, which support CAF and Hybrid MPI+CAF
on InfiniBand
• Optimized Collective designs
HPCAC Switzerland Conference (Mar '15)
CAF Applications
Compiler-generated Code
OpenUH
LibCAF
GASNet
MPI Conduit MVAPICH2-X Conduit IBV Conduit
Collective Optimization
J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with MVAPICH2-X:
Initial experience and evaluation, HIPS’15
57
HPCAC Switzerland Conference (Mar '15) 58
Performance Evaluations for One-sided Communication
0
1000
2000
3000
4000
5000
6000
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV
GASNet-MPI
MV2X
0
1000
2000
3000
4000
5000
6000
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV
GASNet-MPI
MV2X
02468
10
GASNet-IBV GASNet-MPI MV2XLate
ncy (
ms)
02468
10
GASNet-IBV GASNet-MPI MV2XLate
ncy (
ms)
0 100 200 300
bt.D.256
cg.D.256
ep.D.256
ft.D.256
mg.D.256
sp.D.256
GASNet-IBV GASNet-MPI MV2X
Time (sec)
Get NAS-CAF Put
• Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite)
– Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B
• Application performance improvement (NAS-CAF one-sided implementation)
– Reduce the execution time by 12% (BT.D.256), 18% (EP.D.256), 9% (SP.D.256)
3.5X
29%
9%
12%
18%
HPCAC Switzerland Conference (Mar '15) 59
Performance Evaluations for Collective Communication
0
50
100
150
200
250
300
350
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV GASNet-MPI MV2X-Default
MV2X-UCR MV2X-Hybrid
Reduce on 64 cores
0
200
400
600
800
1000
4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
1M
2M
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV GASNet-MPI MV2X-Default
MV2X-UCR MV2X-Hybrid
Broadcast on 64 cores
• Bandwidth improvement (MV2X-Hybrid vs. GASNet-IBV, UH CAF test-suite)
– CO_REDUCE: 2.3X improvement on 1KB, 1.1X improvement on 128KB
– CO_BROADCAST: 4.0X improvement on 1KB, 1.3X improvement on 1MB
– CAF support is available in MVAPICH2-X 2.1rc2
2.3X 4.0X
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication – Hardware-multicast-based
– Offload and Non-blocking
– Topology-aware
– Power-aware
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
60 HPCAC Switzerland Conference (Mar '15)
• Virtualization has many benefits – Job migration
– Compaction
• Not very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters
• Initial designs of MVAPICH2 with SR-IOV support with Openstack
• Will be available publicly soon
Can HPC and Virtualization be Combined?
61 HPCAC Switzerland Conference (Mar '15)
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based
Virtualized InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand
Clysters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC
Clouds, CCGrid’15
• Native vs. SR-IOV at basic IB level
• Compared to native latency, 0-24% overhead
• Compared to native bandwidth, 0-17% overhead
Inter-node Inter-VM pt2pt Latency Inter-node Inter-VM pt2pt Bandwidth
IB-Level Performance – Inter-node Pt2Pt Latency & Bandwidth
HPCAC Switzerland Conference (Mar '15) 62
0.1
1
10
100
1000
2 8 32 128 512 2K 8K 32K 128K 512K 2M
Late
ncy
(u
s)
Message Size (Bytes)
IB-Native
IB-SR-IOV
10%
1
10
100
1000
10000
2 8 32 128 512 2K 8K 32K 128K 512K 2M
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
IB-Native
IB-SR-IOV
10%
24% 17%
0% 0%
MPI-Level Performance – Inter-node Pt2Pt Latency & Bandwidth
• EC2 C3.2xlarge instances
• Similar performance with MV2-SR-IOV-Def
• Compared to Native, similar overhead as basic IB level
• Compared to EC2, up to 29X and 16X performance speedup on Latency &
Bandwidth
HPCAC Switzerland Conference (Mar '15)
Inter-node Inter-VM pt2pt Latency Inter-node Inter-VM pt2pt Bandwidth
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Late
ncy
(u
s)
Message Size (Bytes)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
MV2-EC2
0
1000
2000
3000
4000
5000
6000
7000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
MV2-EC2
0%
29X
16X
24% 16% 0% 11%
10%
63
• EC2 C3.2xlarge instances
• Compared to MV2-SR-IOV-Def, up to 84% and 158% performance improvement on
Latency & Bandwidth
• Compared to Native, 3-7% overhead for Latency, 3-8% overhead for Bandwidth
• Compared to EC2, up to 160X and 28X performance speedup on Latency & Bandwidth
Intra-node Inter-VM pt2pt Latency Intra-node Inter-VM pt2pt Bandwidth
MPI-Level Performance – Intra-node Pt2Pt Latency & Bandwidth
HPCAC Switzerland Conference (Mar '15) 64
0.1
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Late
ncy
(u
s)
Message Size (Bytes)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
MV2-EC2
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
MV2-EC2
160X 28X
3%
5%
7%
8%
5% 3%
0
50
100
150
200
250
300
350
400
450
20,10 20,16 20,20 22,10 22,16 22,20
Exe
cuti
on
Tim
e (
us)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
4%
9%
0
5
10
15
20
25
30
35
FT-64-C EP-64-C LU-64-C BT-64-C
Exe
cuti
on
Tim
e (
s)
NAS Class C
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%
9%
• Compared to Native, 1-9% overhead for NAS
• Compared to Native, 4-9% overhead for Graph500
HPCAC Switzerland Conference (Mar '15) 65
Application-Level Performance (8 VM * 8 Core/VM)
NAS Graph500
HPCAC Switzerland Conference (Mar '15)
NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument
• Large-scale instrument
– Targeting Big Data, Big Compute, Big Instrument research
– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
• Reconfigurable instrument
– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use
• Connected instrument
– Workload and Trace Archive
– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others
– Partnerships with users
• Complementary instrument
– Complementing GENI, Grid’5000, and other testbeds
• Sustainable instrument
– Industry connections
http://www.chameleoncloud.org/
66
• Performance and Memory scalability toward 900K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 7.0 and Beyond
– Support for Intel MIC (Knight Landing)
• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware communication
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
• Hybrid MPI+PGAS programming support with GPGPUs and Accelerators
• Virtualization
MVAPICH2 – Plans for Exascale
67 HPCAC Switzerland Conference (Mar '15)
• Exascale systems will be constrained by – Power
– Memory per core
– Data movement cost
– Faults
• Programming Models and Runtimes need to be designed for – Scalability
– Performance
– Fault-resilience
– Energy-awareness
– Programmability
– Productivity
• Highlighted some of the issues and challenges
• Need continuous innovation on all these fronts
Looking into the Future ….
68 HPCAC Switzerland Conference (Mar '15)
• Accelerating Big Data Processing with Hadoop, Spark and
Memcached
• (March 25th, 10:45-11:30am)
69
More Details on Accelerating Big Data will be covered ..
HPCAC Switzerland Conference (Mar '15)
70
Call For Papers & Participation
International Workshop on Communication
Architectures at Extreme Scale (Exacomm 2015)
In conjunction with International Supercomputing
Conference (ISC 2015)
Frankfurt, Germany, July 16th, 2015
http://web.cse.ohio-state.edu/~subramon/ExaComm15/exacomm15.html
HPCAC Switzerland Conference (Mar '15)
HPCAC Switzerland Conference (Mar '15)
Funding Acknowledgments
Funding Support by
Equipment Support by
71
HPCAC Switzerland Conference (Mar '15)
Personnel Acknowledgments
72
Current Students
– A. Awan (Ph.D.)
– A. Bhat (M.S.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist – S. Sur
Current Post-Doc
– J. Lin
Current Programmer
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– M. Li (Ph.D.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associates
– K. Hamidouche
– X. Lu
Past Programmers
– D. Bureddy
– H. Subramoni
Current Research Specialist
– M. Arnold
HPCAC Switzerland Conference (Mar '15)
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
73
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/