achieving high performance on aws hpc cloud using...

© 2019 © 2019

The Ohio State [email protected]

http://www.cse.ohio-state.edu/~panda

11/19/19

Achieving High Performance on AWS HPC Cloud using MVAPICH2-AWS

Presenter: Dhabaleswar K (DK) Panda

mailto:[email protected]

© 2019

Agenda

• Overview of the MVAPICH2 Project• Overview of Amazon Elastic Fabric Adapter (EFA)• Designing MVAPICH2 for EFA• Performance Results• Software Release and Future Plans

© 2019

Overview of the MVAPICH2 MPI Library ProjectHigh Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

• MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)• MVAPICH2-X (MPI + PGAS), Available since 2011• Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014• Support for Virtualization (MVAPICH2-Virt), Available since 2015• Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

• Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

• Used by more than 3,050 organizations in 89 countries• More than 615,000 (> 0.6 million) downloads from the OSU site directly• Empowering many TOP500 clusters (June ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 15th, 570,020 cores (Neurion) in South Korea and many others

• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

• http://mvapich.cse.ohio-state.eduEmpowering Top500 systems for over a decade Partner in the 5th ranked TACC Frontera System

http://mvapich.cse.ohio-state.edu/

© 2019

0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec

-05

May

-06

Oct

-06

Mar

-07

Aug-

07Ja

n-08

Jun-

08N

ov-0

8Ap

r-09

Sep-

09Fe

b-10

Jul-1

0D

ec-1

0M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-1

4Se

p-14

Feb-

15Ju

l-15

Dec

-15

May

-16

Oct

-16

Mar

-17

Aug-

17Ja

n-18

Jun-

18N

ov-1

8Ap

r-19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads

© 2019

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, EFA)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

Memory CMA IVSHMEM

Modern Features

NVLink CAPI*XPMEM

© 2019

Agenda


© 2019

Amazon Elastic Fabric Adapter (EFA)

• Enhanced version of Elastic Network Adapter (ENA)• Allows OS bypass, up to 100 Gbps bandwidth• Network aware multi-path routing• Exposed through libibverbs and libfabric interfaces• Introduces new Queue-Pair (QP) type

• Scalable Reliable Datagram (SRD)• Also supports Unreliable Datagram (UD)• No support for Reliable Connected (RC)

© 2019

Comparison with IB Transport Types and Trade-offs

Attribute ReliableConnection

ReliableDatagram

Dynamic Connected

ScalableReliable

Datagram (SRD)

UnreliableConnection

UnreliableDatagram

RawDatagram

Scalability(M processes, N

nodes)

M2N QPs per HCA

M QPs per HCA

M QPs per HCA

M QPs per HCA

M2N QPs per HCA

M QPs per HCA

1 QP per HCA

Rel

iabi

lity

Corrupt data detected Yes

Data DeliveryGuarantee Data delivered exactly once No guarantees

Data Order Guarantees Per connection

One source to multiple

destinationsPer connection No

Unordered, duplicate data

detectedNo No

Data Loss Detected Yes No No

Error Recovery

Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken,

protection violation, etc.)

Errors are reported to responder

None None

© 2019

Scalable Reliable Datagrams (SRD): Features & Limitations

• Similar to IB Reliable Datagram– No limit on number of outstanding messages

per context

• Out of order delivery– No head-of-line blocking

– Bad fit for MPI, can suit other workloads

• Packet spraying over multiple ECMP paths– No hotspots

– Fast and transparent recovery from network failures

• Congestion control designed for large scale– Minimize jitter and tail latency

© 2019

Verbs level evaluation of EFA performance

• SRD adds 8-10% overhead compared to UD • Due to hardware based acks used for reliability

• Instance type: c5n.18xlarge• CPU: Intel Xeon Platinum 8124M @ 3.00GHz

© 2019

Agenda


© 2019

Designing MPI libraries for EFAP1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Shared Memory ModelSHMEM, DSM

Distributed Memory Model MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)OpenSHMEM, UPC, UPC++, CAF …

Logical shared memory

• MPI offer various communication primitives• Point-to-point, Collective, Remote Memory Access• Provides strict guarantees about reliability and ordering• Allows message sizes much larger than allowed by the network

• How to address these semantic mismatches between the network and programming model in a scalable and high-performance manner?

© 2019

Handled three Major Challenges

• Reliable and in-order delivery• Zero-copy transmission of large messages• Handling out-of-order packets for zero-copy transfers

S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot Interconnect, 2019

© 2019

Agenda


© 2019

Experimental Setup

• Instance type: c5n.18xlarge• CPU: Intel Xeon Platinum 8124M @ 3.00GHz• Cores: 2 Sockets, 18 cores / socket• KVM Hypervisor, 192 GB RAM, One EFA adapter / node• MVAPICH2 version: Latest MVAPICH2-X + SRD support• OpenMPI version: Open MPI v4.0.2 with libfabric 1.8• OMB version: OSU Micro-Benchmarks 5.6.2

© 2019

Point-to-Point Performance

• Up to 1.8x better than OpenMPI on large message sizes

0

5

10

15

20

25

0 2 8 32 128 512 2K

Late

ncy

(mic

rose

cond

s)

Message Size (Bytes)

Inter-node Latency – Small

MVAPICH2-XOpenMPI-4.0.2

0

500

1000

1500

2000

2500

8K 32K 128K 512K 2M

Late

ncy

(mic

rose

cond

s)


Inter-node Latency – Large

MVAPICH2-X

OpenMPI-4.0.2

© 2019

Collectives (4-Node-128-Processes)

1

10

100

1000

10000

100000

Late

ncy

(mic

rose

cond

s)


osu_bcast

MVAPICH2-X

OpenMPI-4.0.2

1

10

100

1000

10000

100000

Late

ncy

(mic

rose

cond

s)


osu_allreduce

MVAPICH2-X

OpenMPI-4.0.2

• Up to 26x lower all-reduce latency with MVAPICH2-X than with OpenMPI• Up to 15x lower bcast latency with MVAPICH2-X than with OpenMPI

26x better 15x better

© 2019

Collectives (4-Node-128-Processes) Cont’d

1

10

100

1000

10000

100000

Late

ncy

(mic

rose

cond

s)


osu_scatter

MVAPICH2-X

OpenMPI-4.0.2

1

10

100

1000

10000

100000

1000000

Late

ncy

(mic

rose

cond

s)


osu_reduce

MVAPICH2-X

OpenMPI-4.0.2

• Up to 38x lower reduce latency with MVAPICH2-X than with OpenMPI• Up to 18x lower scatter latency with MVAPICH2-X than with OpenMPI

38x better

18x better

© 2019

Collectives (16-Node-256-Processes)

1

10

100

1000

10000

100000

1000000

Late

ncy

(mic

rose

cond

s)


osu_allreduce

MVAPICH2-X

OpenMPI-4.0.2

1

10

100

1000

10000

100000

Late

ncy

(mic

rose

cond

s)


osu_bcast

MVAPICH2-X

OpenMPI-4.0.2

• Up to 87x lower allreduce latency with MVAPICH2-X than with OpenMPI• Up to 5.5x lower bcast latency with MVAPICH2-X than with OpenMPI

87x better

5.5x better

© 2019

Collectives (16-Node-256-Processes) Cont’d

1

10

100

1000

10000

100000

1000000

Late

ncy

(mic

rose

cond

s)


osu_reduce

MVAPICH2-X

OpenMPI-4.0.2

1

10

100

1000

10000

100000

1000000

Late

ncy

(mic

rose

cond

s)


osu_scatterMVAPICH2-XOpenMPI-4.0.2

• Up to 37x lower reduce latency with MVAPICH2-X than with OpenMPI• Up to 83x lower scatter latency with MVAPICH2-X than with OpenMPI

37x better 83x better

© 2019

Application Performance

05

1015202530

72(2x36) 144(4x36) 288(8x36) 576(16x36)

Exec

utio

n Ti

me

(Sec

onds

)

Processes (Nodes X PPN)

CloverLeafMVAPICH2-X

OpenMPI 40% better

• Up to 23% better performance with 3D-stencil on 16 nodes• Up to 10% performance improvement for Cloverleaf on 16 nodes

S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot Interconnect, 2019

0

200

400

600

800

1000

1200

1 4 16 64 256 1K 4K

Exec

utio

n tim

e (u

s)


3DStencil – 16 nodes

MVAPICH2-XOpenMPI

© 2019

Agenda


© 2019

Software Release and Future Plans• MVAPICH2-X for AWS 2.3 released on 04/12/2019

• Includes support for SRD and XPMEM based transports

• Available for download from http://mvapich.cse.ohio-state.edu/downloads/

• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2x-aws/

• Working on

• Additional optimizations and tuning, a new version will be released soon

• Making it available it in an integrated manner in the AWS portal

• Making it available through AWS Market Place• Commercial Support available for End-Users, ISVs, and Organizations

• Through X-ScaleSolutions (http://x-scalesolutions.com)

http://mvapich.cse.ohio-state.edu/downloads/

http://mvapich.cse.ohio-state.edu/userguide/mv2x-aws/

http://x-scalesolutions.com/

© 2019

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

Follow us on

https://twitter.com/mvapich

http://nowlab.cse.ohio-state.edu/

https://twitter.com/mvapich

achieving high performance on aws hpc cloud using...

Documents