![Page 1: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/1.jpg)
High-Performance and Scalable Designs of Programming Models for Exascale Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council China Conference (Nov 2014)
by
![Page 2: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/2.jpg)
High-End Computing (HEC): PetaFlop to ExaFlop
HPCAC China, Nov. '14 2
Expected to have an ExaFlop system in 2020-2024!
100
PFlops in
2015
1 EFlops in
2018?
![Page 3: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/3.jpg)
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
3
Two Major Categories of Applications
HPCAC China, Nov. '14
![Page 4: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/4.jpg)
HPCAC China, Nov. '14 4
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
![Page 5: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/5.jpg)
5
Partitioned Global Address Space (PGAS) Models
HPCAC China, Nov. '14
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
- Libraries
• OpenSHMEM
• Global Arrays
• Chapel
![Page 6: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/6.jpg)
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
HPCAC China, Nov. '14 6
![Page 7: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/7.jpg)
Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE, Aries, and Omni Scale)
Multi/Many-core Architectures
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Communication Library or Runtime for Programming Models
7
Accelerators (NVIDIA and MIC)
Middleware
HPCAC China, Nov. '14
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
![Page 8: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/8.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)
• Virtualization
Challenges in Designing (MPI+X) at Exascale
8 HPCAC China, Nov. '14
![Page 9: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/9.jpg)
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,250 organizations in 74 countries
– More than 225,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 519,640-core cluster (Stampede) at TACC
• 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology
• 23rd, 96,192-core (Pleiades) at NASA and many others
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
9 HPCAC China, Nov. '14
![Page 10: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/10.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
10 HPCAC China, Nov. '14
![Page 11: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/11.jpg)
HPCAC China, Nov. '14
Latency & Bandwidth: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB--DualFDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99
1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
11
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
![Page 12: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/12.jpg)
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
12
Latest MVAPICH2 2.1a
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
HPCAC China, Nov. '14
![Page 13: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/13.jpg)
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
13
Latest MVAPICH2 2.1a, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
HPCAC China, Nov. '14
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put6881
6876
14926
15364
![Page 14: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/14.jpg)
• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
• NAMD performance improves when there is frequent communication to many peers
HPCAC China, Nov. '14
Minimizing memory footprint with XRC and Hybrid Mode
Memory Usage Performance on NAMD (1024 cores)
14
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• Runtime Parameters: RC - default;
• UD - MV2_USE_ONLY_UD=1
• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable
Connection,” Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16KMe
mo
ry (
MB
/pro
cess
)
Connections
MVAPICH2-RC
MVAPICH2-XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
No
rmal
ize
d T
ime
Dataset
MVAPICH2-RC MVAPICH2-XRC
0
2
4
6
128 256 512 1024
Tim
e (
us)
Number of Processes
UD Hybrid RC
26% 40% 30% 38%
![Page 15: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/15.jpg)
HPCAC China, Nov. '14 15
Dynamic Connected (DC) Transport in MVAPICH2
No
de
0
P1
P0
Node 1
P3
P2
Node 3
P7
P6
No
de
2
P5
P4
IB Network
• Constant connection cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC Initiator) and receive (DC Target)
– DC Target identified by “DCT Number”
– Messages routed with (DCT Number, LID)
– Requires same “DC Key” to enable communication
• Initial study done in MVAPICH2
• DCT support available in Mellanox OFED
0
0.5
1
160 320 620
No
rmal
ize
d E
xecu
tio
n
Tim
e
Number of Processes
NAMD - Apoa1: Large data set RC DC-Pool UD XRC
10 22
47 97
1 1 1 2
10 10 10 10
1 1
3 5
1
10
100
80 160 320 640
Co
nn
ect
ion
Me
mo
ry (
KB
)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool
UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected
Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
![Page 16: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/16.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
16 HPCAC China, Nov. '14
![Page 17: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/17.jpg)
• Before CUDA 4: Additional copies
– Low performance and low productivity
• After CUDA 4: Host-based pipeline
– Unified Virtual Address
– Pipeline CUDA copies with IB transfers
– High performance and high productivity
• After CUDA 5.5: GPUDirect-RDMA support
– GPU to GPU direct transfer
– Bypass the host memory
– Hybrid design to avoid PCI bottlenecks
HPCAC China, Nov. '14 17
MVAPICH2-GPU: CUDA-Aware MPI
InfiniBand
GPU
GPU Memor
y
CPU
Chip set
![Page 18: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/18.jpg)
CUDA-Aware MPI: MVAPICH2 1.8-2.1 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
18 HPCAC China, Nov. '14
![Page 19: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/19.jpg)
19
Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency
HPCAC China, Nov. '14
MVAPICH2-GDR-2.0 Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
71 %
90 %
2.18 usec 0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
![Page 20: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/20.jpg)
20
Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth
HPCAC China, Nov. '14
10x
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bandwidth
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
2.2x
MVAPICH2-GDR-2.0 Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
![Page 21: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/21.jpg)
21
Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth
HPCAC China, Nov. '14
10x
2x
0
500
1000
1500
2000
2500
3000
3500
4000
1 4 16 64 256 1K 4K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bi-Bandwidth
Message Size (bytes)
Bi-
Ban
dw
idth
(M
B/s
)
MVAPICH2-GDR-2.0 Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
![Page 22: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/22.jpg)
22
Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device )
HPCAC China, Nov. '14
2.88 us
83%
MVAPICH2-GDR-2.0 Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
MPI-3 RMA provides flexible synchronization and completion primitives
0
5
10
15
20
25
30
35
0 2 8 32 128 512 2K 8K
MV2-GDR2.0
MV2-GDR2.0b
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
![Page 23: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/23.jpg)
HPCAC China, Nov. '14 23
Application-Level Evaluation (HOOMD-blue)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• MV2-GDR 2.0 (released on 08/23/14) : try it out !!
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0
MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768
MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
0
1000
2000
3000
4000
4 8 16 32 64
Ave
rage
TP
S
Number of GPU Nodes
Weak Scalability with 2K Particles/GPU
0
200
400
600
800
1000
1200
1400
1600
1800
2000
4 8 16 32 64
Ave
rage
TP
S
Number of GPU Nodes
Strong Scalability with 256K Particles
![Page 24: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/24.jpg)
• MVAPICH2-2.0 with GDR support can be downloaded from https://mvapich.cse.ohio-
state.edu/download/mvapich2gdr/
• System software requirements
• Mellanox OFED 2.1 or later
• NVIDIA Driver 331.20 or later
• NVIDIA CUDA Toolkit 6.0 or later
• Plugin for GPUDirect RDMA
– http://www.mellanox.com/page/products_dyn?product_family=116
– Strongly Recommended : use the new GDRCOPY module from NVIDIA
• https://github.com/drossetti/gdrcopy
• Has optimized designs for point-to-point communication using GDR
• Contact MVAPICH help list with any questions related to the package
Using MVAPICH2-GPUDirect Version
HPCAC China, Nov. '14 24
![Page 25: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/25.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
25 HPCAC China, Nov. '14
![Page 26: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/26.jpg)
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
26 HPCAC China, Nov. '14
![Page 27: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/27.jpg)
27
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only
• Symmetric Mode
• Multi-MIC Node Configurations
HPCAC China, Nov. '14
![Page 28: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/28.jpg)
MIC-Remote-MIC P2P Communication
28 HPCAC China, Nov. '14
Bandwidth
Bette
r Bet
ter
Bet
ter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0200040006000800010000120001400016000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
Bette
r 5594
Bandwidth
![Page 29: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/29.jpg)
29
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
HPCAC China, Nov. '14
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M) Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Allgather (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
Exe
cuti
on
Tim
e (
secs
)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
![Page 30: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/30.jpg)
30
Latest Status on MVAPICH2-MIC
HPCAC China, Nov. '14
• Running on three major systems
– Stampede : module swap mvapich2 mvapich2-mic/20130911
– Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9
– Beacon (UTK) : module unload intel-mpi; module load mvapich2-
mic/1.9
• A new version based on MVAPICH2 2.0 is being worked out
• Will be available soon
![Page 31: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/31.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
31 HPCAC China, Nov. '14
![Page 32: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/32.jpg)
MVAPICH2-X for Hybrid MPI + PGAS Applications
HPCAC China, Nov. '14 32
MPI Applications, OpenSHMEM Applications, UPC
Applications, Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM available with
MVAPICH2-X 1.9 onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant
– Scalable Inter-node and Intra-node communication – point-to-point and collectives
![Page 33: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/33.jpg)
OpenSHMEM Application Evaluation
33
0
100
200
300
400
500
600
256 512
Tim
e (
s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
0
10
20
30
40
50
60
70
80
256 512
Tim
e (
s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
Heat Image Daxpy
• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA
• Execution time for 2DHeat Image at 512 processes (sec):
- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-SHMEM – 169
• Execution time for DAXPY at 512 processes (sec):
- UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-SHMEM – 5.2
J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of
OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
HPCAC China, Nov. '14
![Page 34: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/34.jpg)
HPCAC China, Nov. '14 34
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
0
1
2
3
4
5
6
7
8
9
26 27 28 29
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0
1
2
3
4
5
6
7
8
9
1024 2048 4096 8192
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
![Page 35: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/35.jpg)
35
Hybrid MPI+OpenSHMEM Sort Application Execution Time
Strong Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• Execution Time
- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds
- 51% improvement over MPI-based design
• Strong Scalability (configuration: constant input size of 500GB)
- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min
- 55% improvement over MPI based design
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
512 1,024 2,048 4,096
Sort
Rat
e (
TB/m
in)
No. of Processes
MPI
Hybrid
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (
seco
nd
s)
Input Data - No. of Processes
MPI
Hybrid
51% 55%
HPCAC China, Nov. '14
J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. K. Panda, Designing Scalable Out-of-core
Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14, Oct 2014
![Page 36: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/36.jpg)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime
• Virtualization
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale
36 HPCAC China, Nov. '14
![Page 37: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/37.jpg)
• Virtualization has many benefits – Job migration
– Compaction
• Not very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters
• Initial designs of MVAPICH2 with SR-IOV support with Openstack
• Will be available publicly soon
Can HPC and Virtualization be Combined?
37 HPCAC China, Nov. '14
![Page 38: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/38.jpg)
Intra-node Inter-VM Point-to-Point Latency and Bandwidth
• 1 VM per Core
• MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-8%
(BW) overheads, compared to MAPICH2 over Native
InfiniBand Verbs (MVAPICH2-Native-IB)
0
20
40
60
80
100
120
140
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(u
s)
Message Size (byte)
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Ban
dw
idth
(M
B/s
)
Message Size (byte)
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
HPCAC China, Nov. '14 38
![Page 39: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/39.jpg)
Performance Evaluations with NAS and Graph500
• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally
• MVAPICH2-SR-IOV-IB brings 4-8% and 4-9% overheads for
NAS Benchmarks and Graph500, respectively, compared to
MVAPICH2-Native-IB
HPCAC China, Nov. '14 39
0
5
10
15
20
25
30
BT-64-C MG-64-C LU-64-C EP-64-C SP-64-C FT-64-C
Exec
uti
on
Tim
e (
s)
NAS Benchmarks
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
0
100
200
300
400
500
600
700
800
900
20,10 20,16 20,20 22,10 22,16 22,20 24,10
Exec
uti
on
Tim
e (
ms)
Graph500
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
![Page 40: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/40.jpg)
Performance Evaluation with LAMMPS
• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally
• MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ and CHAIN
in LAMMPS, respectively, compared to MVAPICH2-Native-IB
0
5
10
15
20
25
30
35
40
LJ-64-scaled CHAIN-64-scaled
Exec
uti
on
Tim
e(s
)
LAMMPS
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
HPCAC China, Nov. '14 40
J. Zhang, X. Lu, J. Jose, R. Shi and Dhabaleswar K. (DK) Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV
based Virtualized InfiniBand Clusters?, EuroPar 2014, August 2014
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li and Dhabaleswar K. (DK) Panda, High Performance MPI Library over SR-IOV Enabled
InfiniBand Clusters, HiPC ’14, Dec. 14
![Page 41: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/41.jpg)
• Performance and Memory scalability toward 900K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.5 and Beyond
– Support for Intel MIC (Knight Landing)
• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
• Hybrid MPI+PGAS programming support with GPGPUs and Accelerators
• High Performance Virtualization Support
MVAPICH2/MVPICH2-X – Plans for Exascale
41 HPCAC China, Nov. '14
![Page 42: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/42.jpg)
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
42
Two Major Categories of Applications
HPCAC China, Nov. '14
![Page 43: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/43.jpg)
Can High-Performance Interconnects Benefit Big Data Middleware?
• Most of the current Big Data middleware use Ethernet infrastructure with Sockets
• Concerns for performance and scalability
• Usage of high-performance networks is beginning to draw interest from many companies
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?
• Can HPC Clusters with high-performance networks be used for Big Data middleware?
• Initial focus: Hadoop, HBase, Spark and Memcached
43 HPCAC China, Nov. '14
![Page 44: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/44.jpg)
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges
HPCAC China, Nov. '14
Upper level
Changes?
44
![Page 45: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/45.jpg)
Design Overview of HDFS with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS replication
– Parallel replication support
– On-demand connection setup
– InfiniBand/RoCE support
HPCAC China, Nov. '14 45
![Page 46: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/46.jpg)
Communication Times in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
HPCAC China, Nov. '14
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Co
mm
un
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
46
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in
RDMA-Enhanced HDFS, HPDC '14, June 2014
![Page 47: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/47.jpg)
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
HPCAC China, Nov. '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
47
![Page 48: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/48.jpg)
• For 240GB Sort in 64 nodes (512
cores)
– 40% improvement over IPoIB (QDR)
with HDD used for HDFS
Performance Evaluation of Sort and TeraSort
HPCAC China, Nov. '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exe
cuti
on
Tim
e (s
ec)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exe
cuti
on
Tim
e (s
ec)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
(1K cores)
– 38% improvement over IPoIB
(FDR) with HDD used for HDFS
48
![Page 49: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/49.jpg)
Design Overview of Spark with RDMA
• Design Features
– RDMA based shuffle
– SEDA-based plugins
– Dynamic connection management and sharing
– Non-blocking and out-of-order data transfer
– Off-JVM-heap buffer management
– InfiniBand/RoCE support
HPCAC China, Nov. '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data
Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
49
![Page 50: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/50.jpg)
HPCAC China, Nov. '14
Preliminary Results of Spark-RDMA Design - GroupBy
0
1
2
3
4
5
6
7
8
9
4 6 8 10
Gro
up
By
Tim
e (s
ec)
Data Size (GB)
0
1
2
3
4
5
6
7
8
9
8 12 16 20
Gro
up
By
Tim
e (s
ec)
Data Size (GB)
10GigE
IPoIB
RDMA
Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores
• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks
– 18% improvement over IPoIB (QDR) for 10GB data size
• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks
– 20% improvement over IPoIB (QDR) for 20GB data size
50
![Page 51: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/51.jpg)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
• http://hibd.cse.ohio-state.edu
• Users Base: 77 organizations from 13 countries
• RDMA for Apache HBase and Spark
The High-Performance Big Data (HiBD) Project
HPCAC China, Nov. '14 51
![Page 52: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/52.jpg)
• Upcoming Releases of RDMA-enhanced Packages will support
– Hadoop 2.x MapReduce & RPC
– Spark
– HBase
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will
support
– HDFS
– MapReduce
– RPC
• Advanced designs with upper-level changes and optimizations
• E.g. MEM-HDFS
HPCAC China, Nov. '14
Future Plans of OSU High Performance Big Data (HiBD) Project
52
![Page 53: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/53.jpg)
• InfiniBand with RDMA feature is gaining momentum in HPC
systems with best performance and greater usage
• As the HPC community moves to Exascale, new solutions are
needed in the MPI and Hybrid MPI+PGAS stacks for supporting
GPUs, Accelerators and virtualization
• Demonstrated how such solutions can be designed with
MVAPICH2/MVAPICH2-X and their performance benefits
• New solutions are also needed to re-design software libraries for
Big Data environments to take advantage of RDMA
• Such designs will allow application scientists and engineers to
take advantage of upcoming exascale systems 53
Concluding Remarks
HPCAC China, Nov. '14
![Page 54: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/54.jpg)
• Keynote Talk, Accelerating Big Data Processing on Modern HPC
Clusters
– Big Data Benchmarks, Performance Optimization and Emerging Hardware
(BPOE) Workshop (Nov 6th, 2:05-2:55pm)
• Keynote Talk, Challenges in Designing Communication Libraries
for Exascale Computing System
– HPC China Conference (Nov 7th, 8:30-9:00am)
• Keynote Talk, Networking Technologies for Exascale Computing
Systems: Opportunities and Challenges
– HPC China Interconnect (Nov 7th, 2:00-2:30pm)
54
Additional Challenges and Results will be presented ..
HPCAC China, Nov. '14
![Page 55: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/55.jpg)
HPCAC China, Nov. '14
Funding Acknowledgments
Funding Support by
Equipment Support by
55
![Page 56: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/56.jpg)
HPCAC China, Nov. '14
Personnel Acknowledgments
56
Current Students
– A. Bhat (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist – S. Sur
Current Post-Docs
– K. Hamidouche
– J. Lin
Current Programmers
– M. Arnold
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associate
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
![Page 57: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/57.jpg)
HPCAC China, Nov. '14
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
57
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/
![Page 58: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/58.jpg)
• Looking for Bright and Enthusiastic Personnel to join as
– Post-Doctoral Researchers (博士后)
– PhD Students (博士研究生)
– MPI Programmer/Software Engineer (MPI软件开发工程师)
– Hadoop/Big Data Programmer/Software Engineer (大数据软件开发工程
师)
• If interested, please contact me at this conference and/or send
an e-mail to [email protected] or [email protected]
state.edu
58
Multiple Positions Available in My Group
HPCAC China, Nov. '14
![Page 59: High-Performance and Scalable Designs of Programming ......High-End Computing (HEC): PetaFlop to ExaFlop HPCAC China, Nov. '14 2 Expected to have an ExaFlop system in 2020-2024! 100](https://reader034.vdocuments.us/reader034/viewer/2022051905/5ff6ebbc3847ad2d7d5b24d1/html5/thumbnails/59.jpg)
59
Thanks, Q&A?
谢谢,请提问?
HPCAC China, Nov. '14