high-performance and scalable designs of programming ......• high performance open-source mpi...

73
High-Performance and Scalable Designs of Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Switzerland Conference (2015) by

Upload: others

Post on 17-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Switzerland Conference (2015)

by

Page 2: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

High-End Computing (HEC): PetaFlop to ExaFlop

HPCAC Switzerland Conference (Mar '15) 2

100-200

PFlops in

2016-2017

1 EFlops in

2020-2024?

Page 3: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 3

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

86%

Page 4: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters •

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

4

Multi-core Processors

HPCAC Switzerland Conference (Mar '15)

Page 5: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• 225 IB Clusters (45%) in the November 2014 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (21 systems):

HPCAC Switzerland Conference (Mar '15)

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (7th) 73,584 cores (Spirit) at USA/Air Force (31st)

72,800 cores Cray CS-Storm in US (10th) 77,184 cores (Curie thin nodes) at France/CEA (33rd)

160,768 cores (Pleiades) at NASA/Ames (11th) 65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (34th)

72,000 cores (HPC2) in Italy (12th) 120, 640 cores (Nebulae) at China/NSCS (35th)

147,456 cores (Super MUC) in Germany (14th) 72,288 cores (Yellowstone) at NCAR (36th)

76,032 cores (Tsubame 2.5) at Japan/GSIC (15th) 70,560 cores (Helios) at Japan/IFERC (38th)

194,616 cores (Cascade) at PNNL (18th) 38,880 cores (Cartesisu) at Netherlands/SURFsara (45th)

110,400 cores (Pangea) at France/Total (20th) 23,040 cores (QB-2) at LONI/US (46th)

37,120 cores T-Platform A-Class at Russia/MSU (22nd) 138,368 cores (Tera-100) at France/CEA (47th)

50,544 cores Occigen at France/GENCI-CINES (26th) and many more!

5

Page 6: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

6

Two Major Categories of Applications

HPCAC Switzerland Conference (Mar '15)

Page 7: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

Towards Exascale System (Today and Target)

Systems 2015 Tianhe-2

2020-2024 Difference Today & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW (3 Gflops/W)

~20 MW (50 Gflops/W)

O(1) ~15x

System memory 1.4 PB (1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s (0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M 12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra

7 HPCAC Switzerland Conference (Mar '15)

Page 8: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 8

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

Page 9: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Energy and Power Challenge

– Hard to solve power requirements for data movement

• Memory and Storage Challenge

– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge

– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge

– Low voltage devices (for low power) introduce more faults

HPCAC Switzerland Conference (Mar '15) 9

Basic Design Challenges for Exascale Systems

Page 10: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Power required for data movement operations is one of

the main challenges

• Non-blocking collectives

– Overlap computation and communication

• Much improved One-sided interface

– Reduce synchronization of sender/receiver

• Manage concurrency

– Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenSHMEM, CAF)

• Resiliency

– New interface for detecting failures

HPCAC Switzerland Conference (Mar '15) 10

How does MPI Plan to Meet Exascale Challenges?

Page 11: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Major features

– Non-blocking Collectives

– Improved One-Sided (RMA) Model

– MPI Tools Interface

• Specification is available from: http://www.mpi-

forum.org/docs/mpi-3.0/mpi30-report.pdf

• MPI 3.1 and 4.0 are underway

HPCAC Switzerland Conference (Mar '15) 11

Major New Features in MPI-3

Page 12: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

12

Partitioned Global Address Space (PGAS) Models

HPCAC Switzerland Conference (Mar '15)

• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

- Libraries

• OpenSHMEM

• Global Arrays

• Chapel

Page 13: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Hierarchical architectures with multiple address spaces

• (MPI + PGAS) Model

– MPI across address spaces

– PGAS within an address space

• MPI is good at moving data between address spaces

• Within an address space, MPI can interoperate with other shared

memory programming models

• Can co-exist with OpenMP for offloading computation

• Applications can have kernels with different communication patterns

• Can benefit from different models

• Re-writing complete applications can be a huge effort

• Port critical kernels to the desired model instead

HPCAC Switzerland Conference (Mar '15) 13

MPI+PGAS for Exascale Architectures and Applications

Page 14: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC Application

Kernel 2 PGAS

Kernel N PGAS

HPCAC Switzerland Conference (Mar '15) 14

Page 15: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-core Architectures

Point-to-point Communication (two-sided & one-sided)

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Communication Library or Runtime for Programming Models

15

Accelerators (NVIDIA and MIC)

Middleware

HPCAC Switzerland Conference (Mar '15)

Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Page 16: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

– Power-aware

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Integrated Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)

• Virtualization

Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale

16 HPCAC Switzerland Conference (Mar '15)

Page 17: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Extreme Low Memory Footprint – Memory per core continues to decrease

• D-L-A Framework

– Discover

• Overall network topology (fat-tree, 3D, …)

• Network topology for processes for a given job

• Node architecture

• Health of network and node

– Learn

• Impact on performance and scalability

• Potential for failure

– Adapt

• Internal protocols and algorithms

• Process mapping

• Fault-tolerance solutions

– Low overhead techniques while delivering performance, scalability and fault-tolerance

Additional Challenges for Designing Exascale Software Libraries

17 HPCAC Switzerland Conference (Mar '15)

Page 18: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over

Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Used by more than 2,350 organizations in 75 countries

– More than 242,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘14 ranking)

• 7th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 160,768-core cluster (Pleiades) at NASA

• 15th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux

Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

Stampede at TACC (7th in Nov 2014, 519,640 cores, 5.168 Plops)

MVAPICH2 Software

18 HPCAC Switzerland Conference (Mar '15)

Page 19: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale

19 HPCAC Switzerland Conference (Mar '15)

Page 20: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15)

Latency & Bandwidth: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00 Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB--DualFDRHaswell-ConnectIB-SingleFDR

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09 1.12

1.22

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Single FDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch

20

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

12810

6304

Page 21: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

21

Latest MVAPICH2 2.1rc2

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s

HPCAC Switzerland Conference (Mar '15)

Page 22: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(u

s)

Message Size (Bytes)

Inter-node Get/Put Latency

Get Put

MPI-3 RMA Get/Put with Flush Performance

22

Latest MVAPICH2 2.1rc2, Intel Sandy-bridge with Connect-IB (single-port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)

Message Size (Bytes)

Intra-socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(u

s)

Message Size (Bytes)

Intra-socket Get/Put Latency

Get Put

0.08 us

HPCAC Switzerland Conference (Mar '15)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)

Message Size (Bytes)

Inter-node Get/Put Bandwidth

Get Put6881

6876

14926

15364

Page 23: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)

• NAMD performance improves when there is frequent communication to many peers

HPCAC Switzerland Conference (Mar '15)

Minimizing Memory Footprint with XRC and Hybrid Mode

Memory Usage Performance on NAMD (1024 cores)

23

• Both UD and RC/XRC have benefits

• Hybrid for the best of both

• Available since MVAPICH2 1.7 as integrated interface

• Runtime Parameters: RC - default;

• UD - MV2_USE_ONLY_UD=1

• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable

Connection,” Cluster ‘08

0

100

200

300

400

500

1 4 16 64 256 1K 4K 16KMe

mo

ry (

MB

/pro

cess

)

Connections

MVAPICH2-RC

MVAPICH2-XRC

0

0.5

1

1.5

apoa1 er-gre f1atpase jac

No

rmal

ize

d T

ime

Dataset

MVAPICH2-RC MVAPICH2-XRC

0

2

4

6

128 256 512 1024

Tim

e (

us)

Number of Processes

UD Hybrid RC

26% 40% 30% 38%

Page 24: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 24

Minimizing Memory Footprint further by DC Transport

No

de

0

P1

P0

Node 1

P3

P2

Node 3

P7

P6

No

de

2

P5

P4

IB Network

• Constant connection cost (One QP for any peer)

• Full Feature Set (RDMA, Atomics etc)

• Separate objects for send (DC Initiator) and receive (DC Target)

– DC Target identified by “DCT Number”

– Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

• Initial study done in MVAPICH2

• DCT support available in Mellanox OFED

0

0.5

1

160 320 620

No

rmal

ize

d E

xecu

tio

n

Tim

e

Number of Processes

NAMD - Apoa1: Large data set RC DC-Pool UD XRC

10 22

47 97

1 1 1 2

10 10 10 10

1 1

3 5

1

10

100

80 160 320 640

Co

nn

ect

ion

Me

mo

ry (

KB

)

Number of Processes

Memory Footprint for Alltoall

RC DC-Pool

UD XRC

H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected

Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).

Page 25: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

25 HPCAC Switzerland Conference (Mar '15)

Page 26: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 26

Hardware Multicast-aware MPI_Bcast on Stampede

05

10152025303540

2 8 32 128 512

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages (102,400 Cores)

Default

Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

050

100150200250300350400450

2K 8K 32K 128K

Late

ncy

(u

s)

Message Size (Bytes)

Large Messages (102,400 Cores)

Default

Multicast

0

5

10

15

20

25

30

Late

ncy

(u

s)

Number of Nodes

16 Byte Message

Default

Multicast

0

50

100

150

200

Late

ncy

(u

s)

Number of Nodes

32 KByte Message

Default

Multicast

Page 27: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

0

1

2

3

4

5

512 600 720 800Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Hardware

27

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

HPCAC Switzerland Conference (Mar '15)

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective

Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12

Page 28: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 28

Network-Topology-Aware Placement of Processes

How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable

InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper

Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores

• 15% improvement in Hypre execution time @ 1024 cores

Page 29: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

29

Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison: MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for

Infiniband Clusters”, ICPP ‘10

CPMD Application Performance and Power Savings

5% 32%

HPCAC Switzerland Conference (Mar '15)

Page 30: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

30 HPCAC Switzerland Conference (Mar '15)

Page 31: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver:

MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

31 HPCAC Switzerland Conference (Mar '15)

MPI + CUDA - Naive

Page 32: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

PCIe

GPU

CPU

NIC

Switch

At Sender:

for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,

…);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance

32 HPCAC Switzerland Conference (Mar '15)

MPI + CUDA - Advanced

Page 33: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

33 HPCAC Switzerland Conference (Mar '15)

GPU-Aware MPI Library: MVAPICH2-GPU

Page 34: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

CUDA-Aware MPI: MVAPICH2 1.8-2.1 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

34 HPCAC Switzerland Conference (Mar '15)

Page 35: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• OFED with support for GPUDirect RDMA is

developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using GPUDirect

RDMA

– Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on

SandyBridge and IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX

VPI adapters

– Support for RoCE with Mellanox ConnectX VPI

adapters

HPCAC Switzerland Conference (Mar '15) 35

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s

P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

Page 36: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

36

Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency

HPCAC Switzerland Conference (Mar '15)

MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.1a

MV2-GDR2.0b

MV2 w/o GDR

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

2.5X

8.6X

2.37 usec

Page 37: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

37

Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth

HPCAC Switzerland Conference (Mar '15)

MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.1a

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bandwidth

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

10x

2.2X

Page 38: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

38

Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth

HPCAC Switzerland Conference (Mar '15)

MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

0

500

1000

1500

2000

2500

3000

3500

4000

1 4 16 64 256 1K 4K

MV2-GDR2.1a

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bi-Bandwidth

Message Size (bytes)

Bi-

Ban

dw

idth

(M

B/s

)

10x

2x

Page 39: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

39

Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device )

HPCAC Switzerland Conference (Mar '15)

MVAPICH2-GDR-2.1a Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

MPI-3 RMA provides flexible synchronization and completion primitives

2.88 us

7X

0

5

10

15

20

25

30

35

0 2 8 32 128 512 2K 8K

MV2-GDR2.1a

MV2-GDR2.0b

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 40: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 40

Application-Level Evaluation (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0

MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768

MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

0

1000

2000

3000

4000

4 8 16 32 64

Ave

rage

TP

S

Number of GPU Nodes

Weak Scalability with 2K Particles/GPU

0

200

400

600

800

1000

1200

1400

1600

1800

2000

4 8 16 32 64

Ave

rage

TP

S

Number of GPU Nodes

Strong Scalability with 256K Particles

Page 41: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• MVAPICH2-2.1a with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/#mv2gdr/

• System software requirements

• Mellanox OFED 2.1 or later

• NVIDIA Driver 331.20 or later

• NVIDIA CUDA Toolkit 6.0 or later

• Plugin for GPUDirect RDMA

– http://www.mellanox.com/page/products_dyn?product_family=116

– Strongly Recommended : use the new GDRCOPY module from NVIDIA

• https://github.com/NVIDIA/gdrcopy

• Has optimized designs for point-to-point communication using GDR

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GPUDirect Version

41 HPC Advisory Council Stanford Workshop (Feb '15)

Page 42: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

42 HPCAC Switzerland Conference (Mar '15)

Page 43: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

43 HPCAC Switzerland Conference (Mar '15)

Page 44: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PC

Ie

MIC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-Socket

2. Inter-Socket

3. Inter-Node

4. Intra-MIC

5. Intra-Socket MIC-MIC

6. Inter-Socket MIC-MIC

7. Inter-Node MIC-MIC

8. Intra-Socket MIC-Host

10. Inter-Node MIC-Host

9. Inter-Socket MIC-Host

MPI Process

44

11. Inter-Node MIC-MIC with IB adapter on remote socket

and more . . .

• Critical for runtimes to optimize data movement, hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

HPCAC Switzerland Conference (Mar '15)

Page 45: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

45

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only Mode

• Symmetric Mode

• Internode Communication

• Coprocessors-only

• Symmetric Mode

• Multi-MIC Node Configurations

HPCAC Switzerland Conference (Mar '15)

Page 46: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MIC-Remote-MIC P2P Communication with Proxy-based Communication

46 HPCAC Switzerland Conference (Mar '15)

Bandwidth

Bette

r Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

5236

Intra-socket P2P

Inter-socket P2P

0200040006000800010000120001400016000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

Bette

r 5594

Bandwidth

Page 47: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

47

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

HPCAC Switzerland Conference (Mar '15)

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Allgather (16H + 16 M) Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

) x 1

00

0

Message Size (Bytes)

32-Node-Allgather (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

) x 1

00

0

Message Size (Bytes)

32-Node-Alltoall (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance

Communication

Computation

76%

58%

55%

Page 48: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

48

Latest Status on MVAPICH2-MIC

HPCAC Switzerland Conference (Mar '15)

• Running on three major systems

– Stampede

– Blueridge(Virginia Tech)

– Beacon (UTK)

• MVAPICH2-MIC 2.0 was released (12/02/14)

• Try it out!!

– http://mvapich.cse.ohio-state.edu/downloads/#mv2mic

Page 49: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

49 HPCAC Switzerland Conference (Mar '15)

Page 50: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 50

MPI Tools Interface

• Introduced to expose internals of MPI tools and applications

• Generalized interface – no defined variables in the standard

• Variables can differ between

• MPI implementations

• Compilations of same MPI library (production vs debug)

• Executions of the same application/MPI library

• There could be no variables provided

• Two types of variables supported

• Control Variables (CVARS)

• Typically used to configure and tune MPI internals

• Environment variables, configuration parameters and toggles

• Performance Variables (PVARS)

• Insights into performance of an MPI library

• Highly-implementation specific

• Memory consumption, timing information, resource-usage, data transmission info.

• Per-call basis or an entire MPI job

Page 51: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MPI_T_init_thread()

MPI_T_cvar_get_info(MV2_EAGER_THRESHOLD)

if (msg_size < MV2_EAGER_THRESHOLD + 1KB)

MPI_T_cvar_write(MV2_EAGER_THRESHOLD, +1024)

MPI_Send(..)

MPI_T_finalize()

HPCAC Switzerland Conference (Mar '15) 51

Co-designing Applications to use MPI-T

Initialize MPI-T

Get #variables

Query Metadata

Allocate Session

Allocate Handle

Read/Write/Reset

Start/Stop var

Free Handle

Finalize MPI-T

Free Session

Allocate Handle

Read/Write var

Free Handle

Performance

Variables

Control

Variables

Example: Optimizing the eager limit dynamically ->

R. Rajachan, J. Perkins, K. Hamidouche, M. Arnold, and D. K. Panda, Understanding the Memory-Utilization of MPI Libray : Challenges and

Designs in Implementing MPI_T interface. EuroMPI’14

Page 52: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

52 HPCAC Switzerland Conference (Mar '15)

Page 53: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MVAPICH2-X for Hybrid MPI + PGAS Applications

HPCAC Switzerland Conference (Mar '15)

MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)

Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with

MVAPICH2-X 1.9 (2011) onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls

53

Page 54: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

OpenSHMEM Application Evaluation

54

0

100

200

300

400

500

600

256 512

Tim

e (

s)

Number of Processes

UH-SHMEM

OMPI-SHMEM (FCA)

Scalable-SHMEM (FCA)

MV2X-SHMEM

0

10

20

30

40

50

60

70

80

256 512

Tim

e (

s)

Number of Processes

UH-SHMEM

OMPI-SHMEM (FCA)

Scalable-SHMEM (FCA)

MV2X-SHMEM

Heat Image Daxpy

• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA

• Execution time for 2DHeat Image at 512 processes (sec):

- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-SHMEM – 169

• Execution time for DAXPY at 512 processes (sec):

- UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-SHMEM – 5.2

J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries

on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014

HPCAC Switzerland Conference (Mar '15)

Page 55: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 55

Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

0

1

2

3

4

5

6

7

8

9

26 27 28 29

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

Scale

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

0

1

2

3

4

5

6

7

8

9

1024 2048 4096 8192

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance

Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 8,192 processes - 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes - 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple

Page 56: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

56

Hybrid MPI+OpenSHMEM Sort Application Execution Time

Strong Scalability

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• Execution Time

- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds

- 51% improvement over MPI-based design

• Strong Scalability (configuration: constant input size of 500GB)

- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min

- 55% improvement over MPI based design

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

512 1,024 2,048 4,096

Sort

Rat

e (

TB/m

in)

No. of Processes

MPI

Hybrid

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (

seco

nd

s)

Input Data - No. of Processes

MPI

Hybrid

51% 55%

HPCAC Switzerland Conference (Mar '15)

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. K. Panda, Designing Scalable Out-of-core

Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14, Oct 2014

Page 57: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MVAPICH2-X Support of OpenUH CAF

• Support multiple networks through different conduits: MVAPICH2-X Conduit

is available in MVAPICH2-X release, which support CAF and Hybrid MPI+CAF

on InfiniBand

• Optimized Collective designs

HPCAC Switzerland Conference (Mar '15)

CAF Applications

Compiler-generated Code

OpenUH

LibCAF

GASNet

MPI Conduit MVAPICH2-X Conduit IBV Conduit

Collective Optimization

J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with MVAPICH2-X:

Initial experience and evaluation, HIPS’15

57

Page 58: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 58

Performance Evaluations for One-sided Communication

0

1000

2000

3000

4000

5000

6000

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV

GASNet-MPI

MV2X

0

1000

2000

3000

4000

5000

6000

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV

GASNet-MPI

MV2X

02468

10

GASNet-IBV GASNet-MPI MV2XLate

ncy (

ms)

02468

10

GASNet-IBV GASNet-MPI MV2XLate

ncy (

ms)

0 100 200 300

bt.D.256

cg.D.256

ep.D.256

ft.D.256

mg.D.256

sp.D.256

GASNet-IBV GASNet-MPI MV2X

Time (sec)

Get NAS-CAF Put

• Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite)

– Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B

• Application performance improvement (NAS-CAF one-sided implementation)

– Reduce the execution time by 12% (BT.D.256), 18% (EP.D.256), 9% (SP.D.256)

3.5X

29%

9%

12%

18%

Page 59: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15) 59

Performance Evaluations for Collective Communication

0

50

100

150

200

250

300

350

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV GASNet-MPI MV2X-Default

MV2X-UCR MV2X-Hybrid

Reduce on 64 cores

0

200

400

600

800

1000

4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

1M

2M

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV GASNet-MPI MV2X-Default

MV2X-UCR MV2X-Hybrid

Broadcast on 64 cores

• Bandwidth improvement (MV2X-Hybrid vs. GASNet-IBV, UH CAF test-suite)

– CO_REDUCE: 2.3X improvement on 1KB, 1.1X improvement on 128KB

– CO_BROADCAST: 4.0X improvement on 1KB, 1.3X improvement on 1MB

– CAF support is available in MVAPICH2-X 2.1rc2

2.3X 4.0X

Page 60: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication – Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

60 HPCAC Switzerland Conference (Mar '15)

Page 61: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Virtualization has many benefits – Job migration

– Compaction

• Not very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters

• Initial designs of MVAPICH2 with SR-IOV support with Openstack

• Will be available publicly soon

Can HPC and Virtualization be Combined?

61 HPCAC Switzerland Conference (Mar '15)

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based

Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand

Clysters, HiPC’14

J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC

Clouds, CCGrid’15

Page 62: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Native vs. SR-IOV at basic IB level

• Compared to native latency, 0-24% overhead

• Compared to native bandwidth, 0-17% overhead

Inter-node Inter-VM pt2pt Latency Inter-node Inter-VM pt2pt Bandwidth

IB-Level Performance – Inter-node Pt2Pt Latency & Bandwidth

HPCAC Switzerland Conference (Mar '15) 62

0.1

1

10

100

1000

2 8 32 128 512 2K 8K 32K 128K 512K 2M

Late

ncy

(u

s)

Message Size (Bytes)

IB-Native

IB-SR-IOV

10%

1

10

100

1000

10000

2 8 32 128 512 2K 8K 32K 128K 512K 2M

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

IB-Native

IB-SR-IOV

10%

24% 17%

0% 0%

Page 63: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

MPI-Level Performance – Inter-node Pt2Pt Latency & Bandwidth

• EC2 C3.2xlarge instances

• Similar performance with MV2-SR-IOV-Def

• Compared to Native, similar overhead as basic IB level

• Compared to EC2, up to 29X and 16X performance speedup on Latency &

Bandwidth

HPCAC Switzerland Conference (Mar '15)

Inter-node Inter-VM pt2pt Latency Inter-node Inter-VM pt2pt Bandwidth

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Late

ncy

(u

s)

Message Size (Bytes)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

MV2-EC2

0

1000

2000

3000

4000

5000

6000

7000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

MV2-EC2

0%

29X

16X

24% 16% 0% 11%

10%

63

Page 64: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• EC2 C3.2xlarge instances

• Compared to MV2-SR-IOV-Def, up to 84% and 158% performance improvement on

Latency & Bandwidth

• Compared to Native, 3-7% overhead for Latency, 3-8% overhead for Bandwidth

• Compared to EC2, up to 160X and 28X performance speedup on Latency & Bandwidth

Intra-node Inter-VM pt2pt Latency Intra-node Inter-VM pt2pt Bandwidth

MPI-Level Performance – Intra-node Pt2Pt Latency & Bandwidth

HPCAC Switzerland Conference (Mar '15) 64

0.1

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Late

ncy

(u

s)

Message Size (Bytes)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

MV2-EC2

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

MV2-EC2

160X 28X

3%

5%

7%

8%

5% 3%

Page 65: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

0

50

100

150

200

250

300

350

400

450

20,10 20,16 20,20 22,10 22,16 22,20

Exe

cuti

on

Tim

e (

us)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

4%

9%

0

5

10

15

20

25

30

35

FT-64-C EP-64-C LU-64-C BT-64-C

Exe

cuti

on

Tim

e (

s)

NAS Class C

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%

9%

• Compared to Native, 1-9% overhead for NAS

• Compared to Native, 4-9% overhead for Graph500

HPCAC Switzerland Conference (Mar '15) 65

Application-Level Performance (8 VM * 8 Core/VM)

NAS Graph500

Page 66: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15)

NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument

• Large-scale instrument

– Targeting Big Data, Big Compute, Big Instrument research

– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network

• Reconfigurable instrument

– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use

• Connected instrument

– Workload and Trace Archive

– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others

– Partnerships with users

• Complementary instrument

– Complementing GENI, Grid’5000, and other testbeds

• Sustainable instrument

– Industry connections

http://www.chameleoncloud.org/

66

Page 67: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Performance and Memory scalability toward 900K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 7.0 and Beyond

– Support for Intel MIC (Knight Landing)

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Power-aware communication

• Support for MPI Tools Interface (as in MPI 3.0)

• Checkpoint-Restart and migration support with in-memory checkpointing

• Hybrid MPI+PGAS programming support with GPGPUs and Accelerators

• Virtualization

MVAPICH2 – Plans for Exascale

67 HPCAC Switzerland Conference (Mar '15)

Page 68: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Exascale systems will be constrained by – Power

– Memory per core

– Data movement cost

– Faults

• Programming Models and Runtimes need to be designed for – Scalability

– Performance

– Fault-resilience

– Energy-awareness

– Programmability

– Productivity

• Highlighted some of the issues and challenges

• Need continuous innovation on all these fronts

Looking into the Future ….

68 HPCAC Switzerland Conference (Mar '15)

Page 69: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

• Accelerating Big Data Processing with Hadoop, Spark and

Memcached

• (March 25th, 10:45-11:30am)

69

More Details on Accelerating Big Data will be covered ..

HPCAC Switzerland Conference (Mar '15)

Page 70: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

70

Call For Papers & Participation

International Workshop on Communication

Architectures at Extreme Scale (Exacomm 2015)

In conjunction with International Supercomputing

Conference (ISC 2015)

Frankfurt, Germany, July 16th, 2015

http://web.cse.ohio-state.edu/~subramon/ExaComm15/exacomm15.html

HPCAC Switzerland Conference (Mar '15)

Page 71: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15)

Funding Acknowledgments

Funding Support by

Equipment Support by

71

Page 72: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

HPCAC Switzerland Conference (Mar '15)

Personnel Acknowledgments

72

Current Students

– A. Awan (Ph.D.)

– A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist – S. Sur

Current Post-Doc

– J. Lin

Current Programmer

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– M. Li (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associates

– K. Hamidouche

– X. Lu

Past Programmers

– D. Bureddy

– H. Subramoni

Current Research Specialist

– M. Arnold

Page 73: High-Performance and Scalable Designs of Programming ......• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

[email protected]

HPCAC Switzerland Conference (Mar '15)

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

73

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/