achieving high performance on aws hpc cloud using...
TRANSCRIPT
© 2019 © 2019
The Ohio State [email protected]
http://www.cse.ohio-state.edu/~panda
11/19/19
Achieving High Performance on AWS HPC Cloud using MVAPICH2-AWS
Presenter: Dhabaleswar K (DK) Panda
© 2019
Agenda
• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans
© 2019
Overview of the MVAPICH2 MPI Library ProjectHigh Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
• MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)• MVAPICH2-X (MPI + PGAS), Available since 2011• Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014• Support for Virtualization (MVAPICH2-Virt), Available since 2015• Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
• Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
• Used by more than 3,050 organizations in 89 countries• More than 615,000 (> 0.6 million) downloads from the OSU site directly• Empowering many TOP500 clusters (June ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 15th, 570,020 cores (Neurion) in South Korea and many others
• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
• http://mvapich.cse.ohio-state.eduEmpowering Top500 systems for over a decade Partner in the 5th ranked TACC Frontera System
© 2019
0
100000
200000
300000
400000
500000
600000Se
p-04
Feb-
05Ju
l-05
Dec
-05
May
-06
Oct
-06
Mar
-07
Aug-
07Ja
n-08
Jun-
08N
ov-0
8Ap
r-09
Sep-
09Fe
b-10
Jul-1
0D
ec-1
0M
ay-1
1O
ct-1
1M
ar-1
2Au
g-12
Jan-
13Ju
n-13
Nov
-13
Apr-1
4Se
p-14
Feb-
15Ju
l-15
Dec
-15
May
-16
Oct
-16
Mar
-17
Aug-
17Ja
n-18
Jun-
18N
ov-1
8Ap
r-19
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
© 2019
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, EFA)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
Memory CMA IVSHMEM
Modern Features
NVLink CAPI*XPMEM
© 2019
Agenda
• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans
© 2019
Amazon Elastic Fabric Adapter (EFA)
• Enhanced version of Elastic Network Adapter (ENA)• Allows OS bypass, up to 100 Gbps bandwidth• Network aware multi-path routing• Exposed through libibverbs and libfabric interfaces• Introduces new Queue-Pair (QP) type
• Scalable Reliable Datagram (SRD)• Also supports Unreliable Datagram (UD)• No support for Reliable Connected (RC)
© 2019
Comparison with IB Transport Types and Trade-offs
Attribute ReliableConnection
ReliableDatagram
Dynamic Connected
ScalableReliable
Datagram (SRD)
UnreliableConnection
UnreliableDatagram
RawDatagram
Scalability(M processes, N
nodes)
M2N QPs per HCA
M QPs per HCA
M QPs per HCA
M QPs per HCA
M2N QPs per HCA
M QPs per HCA
1 QP per HCA
Rel
iabi
lity
Corrupt data detected Yes
Data DeliveryGuarantee Data delivered exactly once No guarantees
Data Order Guarantees Per connection
One source to multiple
destinationsPer connection No
Unordered, duplicate data
detectedNo No
Data Loss Detected Yes No No
Error Recovery
Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken,
protection violation, etc.)
Errors are reported to responder
None None
© 2019
Scalable Reliable Datagrams (SRD): Features & Limitations
• Similar to IB Reliable Datagram– No limit on number of outstanding messages
per context
• Out of order delivery– No head-of-line blocking
– Bad fit for MPI, can suit other workloads
• Packet spraying over multiple ECMP paths– No hotspots
– Fast and transparent recovery from network failures
• Congestion control designed for large scale– Minimize jitter and tail latency
© 2019
Verbs level evaluation of EFA performance
• SRD adds 8-10% overhead compared to UD • Due to hardware based acks used for reliability
• Instance type: c5n.18xlarge• CPU: Intel Xeon Platinum 8124M @ 3.00GHz
© 2019
Agenda
• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans
© 2019
Designing MPI libraries for EFAP1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Shared Memory ModelSHMEM, DSM
Distributed Memory Model MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)OpenSHMEM, UPC, UPC++, CAF …
Logical shared memory
• MPI offer various communication primitives• Point-to-point, Collective, Remote Memory Access• Provides strict guarantees about reliability and ordering• Allows message sizes much larger than allowed by the network
• How to address these semantic mismatches between the network and programming model in a scalable and high-performance manner?
© 2019
Handled three Major Challenges
• Reliable and in-order delivery• Zero-copy transmission of large messages• Handling out-of-order packets for zero-copy transfers
S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot Interconnect, 2019
© 2019
Agenda
• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans
© 2019
Experimental Setup
• Instance type: c5n.18xlarge• CPU: Intel Xeon Platinum 8124M @ 3.00GHz• Cores: 2 Sockets, 18 cores / socket• KVM Hypervisor, 192 GB RAM, One EFA adapter / node• MVAPICH2 version: Latest MVAPICH2-X + SRD support• OpenMPI version: Open MPI v4.0.2 with libfabric 1.8• OMB version: OSU Micro-Benchmarks 5.6.2
© 2019
Point-to-Point Performance
• Up to 1.8x better than OpenMPI on large message sizes
0
5
10
15
20
25
0 2 8 32 128 512 2K
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
Inter-node Latency – Small
MVAPICH2-XOpenMPI-4.0.2
0
500
1000
1500
2000
2500
8K 32K 128K 512K 2M
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
Inter-node Latency – Large
MVAPICH2-X
OpenMPI-4.0.2
© 2019
Collectives (4-Node-128-Processes)
1
10
100
1000
10000
100000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_bcast
MVAPICH2-X
OpenMPI-4.0.2
1
10
100
1000
10000
100000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_allreduce
MVAPICH2-X
OpenMPI-4.0.2
• Up to 26x lower all-reduce latency with MVAPICH2-X than with OpenMPI• Up to 15x lower bcast latency with MVAPICH2-X than with OpenMPI
26x better 15x better
© 2019
Collectives (4-Node-128-Processes) Cont’d
1
10
100
1000
10000
100000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_scatter
MVAPICH2-X
OpenMPI-4.0.2
1
10
100
1000
10000
100000
1000000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_reduce
MVAPICH2-X
OpenMPI-4.0.2
• Up to 38x lower reduce latency with MVAPICH2-X than with OpenMPI• Up to 18x lower scatter latency with MVAPICH2-X than with OpenMPI
38x better
18x better
© 2019
Collectives (16-Node-256-Processes)
1
10
100
1000
10000
100000
1000000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_allreduce
MVAPICH2-X
OpenMPI-4.0.2
1
10
100
1000
10000
100000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_bcast
MVAPICH2-X
OpenMPI-4.0.2
• Up to 87x lower allreduce latency with MVAPICH2-X than with OpenMPI• Up to 5.5x lower bcast latency with MVAPICH2-X than with OpenMPI
87x better
5.5x better
© 2019
Collectives (16-Node-256-Processes) Cont’d
1
10
100
1000
10000
100000
1000000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_reduce
MVAPICH2-X
OpenMPI-4.0.2
1
10
100
1000
10000
100000
1000000
Late
ncy
(mic
rose
cond
s)
Message Size (Bytes)
osu_scatterMVAPICH2-XOpenMPI-4.0.2
• Up to 37x lower reduce latency with MVAPICH2-X than with OpenMPI• Up to 83x lower scatter latency with MVAPICH2-X than with OpenMPI
37x better 83x better
© 2019
Application Performance
05
1015202530
72(2x36) 144(4x36) 288(8x36) 576(16x36)
Exec
utio
n Ti
me
(Sec
onds
)
Processes (Nodes X PPN)
CloverLeafMVAPICH2-X
OpenMPI 40% better
• Up to 23% better performance with 3D-stencil on 16 nodes• Up to 10% performance improvement for Cloverleaf on 16 nodes
S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot Interconnect, 2019
0
200
400
600
800
1000
1200
1 4 16 64 256 1K 4K
Exec
utio
n tim
e (u
s)
Message Size (Bytes)
3DStencil – 16 nodes
MVAPICH2-XOpenMPI
© 2019
Agenda
• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans
© 2019
Software Release and Future Plans• MVAPICH2-X for AWS 2.3 released on 04/12/2019
• Includes support for SRD and XPMEM based transports
• Available for download from http://mvapich.cse.ohio-state.edu/downloads/
• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2x-aws/
• Working on
• Additional optimizations and tuning, a new version will be released soon
• Making it available it in an integrated manner in the AWS portal
• Making it available through AWS Market Place• Commercial Support available for End-Users, ISVs, and Organizations
• Through X-ScaleSolutions (http://x-scalesolutions.com)
© 2019
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich