mvapich2 and gpudirect rdma - hpc advisory … and gpudirect rdma dhabaleswar k. (dk) panda, hari...

57
MVAPICH2 and GPUDirect RDMA Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri The Ohio State University E-mail: {panda, subramon, potluri}@cse.ohio-state.edu https://mvapich.cse.ohio-state.edu/ Presentation at HPC Advisory Council, June 2013 by

Upload: ngokhue

Post on 28-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

MVAPICH2 and GPUDirect RDMA

Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri

The Ohio State University

E-mail: {panda, subramon, potluri}@cse.ohio-state.edu

https://mvapich.cse.ohio-state.edu/

Presentation at HPC Advisory Council, June 2013

by

DK-OSU HPC Advisory Council June'13 2

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

• 224 IB Clusters (44.8%) in the November 2012 Top500 list

(http://www.top500.org)

• Installations in the Top 40 (16 systems):

DK-OSU HPC Advisory Council June'13

Large-scale InfiniBand Installations

147, 456 cores (Super MUC) in Germany (6th) 122,400 cores (Roadrunner) at LANL (22nd)

204,900 cores (Stampede) at TACC (7th) 53,504 (PRIMERGY) at Australia/NCI (24th)

77,184 cores (Curie thin nodes) at France/CEA (11th) 78,660 cores (Lomonosov) in Russia (26th )

120, 640 cores (Nebulae) at China/NSCS (12th) 137,200 cores (Sunway Blue Light) in China (28th)

72,288 cores (Yellowstone) at NCAR (13th) 46,208 cores (Zin) at LLNL (29th)

125,980 cores (Pleiades) at NASA/Ames (14th) 33,664 (MareNostrum) at Spain/BSC (36th)

70,560 cores (Helios) at Japan/IFERC (15th) 32,256 (SGI Altix X) at Japan/CRIEPI (39th)

73,278 cores (Tsubame 2.0) at Japan/GSIC (17th) More are getting installed !

138,368 cores (Tera-100) at France/CEA (20th)

3

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in

70 countries

– More than 173,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 204,900-core cluster (Stampede) at TACC

• 14th ranked 125,980-core cluster (Pleiades) at NASA

• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• and many others

– Available with software stacks of many IB, HSE and server vendors

including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

4

MVAPICH2/MVAPICH2-X Software

DK-OSU HPC Advisory Council June'13

• Released on 05/06/13

• Major Features and Enhancements

– Based on MPICH-3.0.3

• Support for all MPI-3 features

– Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)

• Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA

– Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)

• Support for application-level checkpointing

• Support for hierarchical system-level checkpointing

– Scalable UD-multicast-based designs and tuned algorithm selection for collectives

– Improved and tuned MPI communication from GPU device memory

– Improved job startup time

• Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on

homogeneous clusters

– Revamped Build system with support for parallel builds

• MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.

– Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d

5

MVAPICH2 1.9 and MVAPICH2-X 1.9

DK-OSU HPC Advisory Council June'13

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 6

Outline

DK-OSU HPC Advisory Council June'13

DK-OSU HPC Advisory Council June'13 7

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

DK-OSU HPC Advisory Council June'13 8

Bandwidth: MPI over IB

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

0

5000

10000

15000

20000

25000 MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)

• NAMD performance improves when there is frequent communication to many peers

DK-OSU HPC Advisory Council June'13

eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)

9

• Both UD and RC/XRC have benefits

• Hybrid for the best of both

• Available since MVAPICH2 1.7 as integrated interface

• Runtime Parameters: RC - default;

• UD - MV2_USE_ONLY_UD=1

• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,”

Cluster ‘08

0

100

200

300

400

500

1 4 16 64 256 1K 4K 16KMe

mo

ry (

MB

/pro

cess

)

Connections

MVAPICH2-RC

MVAPICH2-XRC

0

0.5

1

1.5

apoa1 er-gre f1atpase jac

No

rmal

ize

d T

ime

Dataset

MVAPICH2-RC MVAPICH2-XRC

0

2

4

6

128 256 512 1024

Tim

e (

us)

Number of Processes

UD Hybrid RC

26% 40% 30% 38%

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

10

Latest MVAPICH2 1.9

Intel Sandy-bridge

0.19 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

12,000MB/s 12,000MB/s

DK-OSU HPC Advisory Council June'13

• MVAPICH2/MVAPICH2-X Overview

– Highly Efficient Intra-node and Inter-node Communication and Scalable

Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

• MVAPICH2 with GPUDirect RDMA

• Conclusion 11

Outline

DK-OSU HPC Advisory Council June'13

DK-OSU HPC Advisory Council June'13 12

Hardware Multicast-aware MPI_Bcast on Stampede

05

10152025303540

2 8 32 128 512

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages (102,400 Cores)

Default

Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

050

100150200250300350400450

2K 8K 32K 128K

Late

ncy

(u

s)

Message Size (Bytes)

Large Messages (102,400 Cores)

Default

Multicast

0

5

10

15

20

25

30

Late

ncy

(u

s)

Number of Nodes

16 Byte Message

Default

Multicast

0

50

100

150

200

Late

ncy

(u

s)

Number of Nodes

32 KByte Message

Default

Multicast

0

1

2

3

4

5

512 600 720 800Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload

13

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

DK-OSU HPC Advisory Council June'13

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with

Collective Offload on InfiniBand Clusters: A Case Study

with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

• MVAPICH2/MVAPICH2-X Overview

– Highly Efficient Intra-node and Inter-node Communication and Scalable

Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

• MVAPICH2 with GPUDirect RDMA

• Conclusion 14

Outline

DK-OSU HPC Advisory Council June'13

Multi-Level Checkpointing with ScalableCR (SCR)

15 DK-OSU HPC Advisory Council June'13

Ch

eckp

oin

t C

ost

an

d R

esili

ency

Low

High

• LLNL’s Scalable Checkpoint/Restart

library

• Can be used for application guided and

application transparent checkpointing

• Effective utilization of storage hierarchy

– Local: Store checkpoint data on node’s local

storage, e.g. local disk, ramdisk

– Partner: Write to local storage and on a

partner node

– XOR: Write file to local storage and small sets

of nodes collectively compute and store parity

redundancy data (RAID-5)

– Stable Storage: Write to parallel file system

Application-guided Multi-Level Checkpointing

16 DK-OSU HPC Advisory Council June'13

0

20

40

60

80

100

PFS MVAPICH2+SCR(Local)

MVAPICH2+SCR(Partner)

MVAPICH2+SCR(XOR)

Ch

eckp

oin

t W

riti

ng

Tim

e (s

)

Representative SCR-Enabled Application

• Checkpoint writing phase times of representative SCR-enabled MPI application

• 512 MPI processes (8 procs/node)

• Approx. 51 GB checkpoints

Transparent Multi-Level Checkpointing

17 DK-OSU HPC Advisory Council June'13

0

2000

4000

6000

8000

10000

MVAPICH2-CR (PFS) MVAPICH2+SCR (Multi-Level)

Ch

eckp

oin

tin

g Ti

me

(ms)

Suspend N/W Reactivate N/W Write Checkpoint

• ENZO Cosmology application – Radiation Transport workload

• Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism

• 512 MPI processes (8 procs/node)

• Approx. 12.8 GB checkpoints

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 18

Outline

DK-OSU HPC Advisory Council June'13

DK-OSU HPC Advisory Council June'13 19

Scalable OpenSHMEM/UPC and Hybrid (MPI, UPC and OpenSHMEM) designs

Hybrid MPI+OpenSHMEM/UPC

• Based on OpenSHMEM Reference Implementation (http://openshmem.org/) & UPC version 2.14.2 (http://upc.lbl.gov/)

• Provides a design over GASNet

• Does not take advantage of all OFED features

• Design Scalable and High-Performance OpenSHMEM & UPC over OFED

• Designing a Hybrid MPI + OpenSHMEM/UPC Model

• Current Model – Separate Runtimes for OpenSHMEM/UPC and MPI

• Possible deadlock if both runtimes are not progressed

• Consumes more network resource

• Our Approach – Single Unified Runtime for MPI and OpenSHMEM/UPC

Available in

MVAPICH2-X 1.9

Hybrid (UPC/OpenSHMEM+MPI) Application

InfiniBand Network

Unified Communication Runtime

UPC/OpenSHMEM

InterfaceMPI

Interface

UPC/OpenSHMEM

callsMPI calls

DK-OSU HPC Advisory Council June'13 20

Performance of Hybrid (OpenSHMEM+MPI) Applications

0

400

800

1200

1600

32 64 128 256 512

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

34%

0

2

4

6

8

32 64 128 256

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

2D Heat Transfer Modeling Graph-500

0

20

40

60

80

32 64 128 256 512

Res

ou

rce

Uti

lizat

ion

(M

B)

No. of Processes

Hybrid-GASNetHybrid-OSU

27%

45%

Network Resource Usage

• Improved Performance for Hybrid

Applications

• 34% improvement for 2DHeat Transfer Modeling with 512 processes

• 45% improvement for Graph500 with 256 processes

• Our approach with single Runtime consumes 27% lesser network resources

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and

Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

DK-OSU HPC Advisory Council June'13 21

Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 2,048 processes

- 1.9X improvement over MPI-CSR (best performing MPI version)

- 2.7X improvement over MPI-Simple (same communication characteristics)

• 8,192 processes - 2.4X improvement over MPI-CSR

- 7.6 X improvement over MPI-Simple

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM

Programming Models, International Supercomputing Conference (ISC’13), June 2013 (Monday, Hall 5, 5:40 – 6:00 PM)

0

1

2

3

4

5

6

7

8

9

2048 8192

Tim

e (

s)

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

26 27 28 29

Trav

ers

ed

Ed

ges

Pe

r Se

con

d (

TEP

S)

Scale

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

1024 2048 4096 8192

Trav

ers

ed

Ed

ges

Pe

r Se

con

d (

TEP

S)

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

2.4X 1.9X

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 22

Outline

DK-OSU HPC Advisory Council June'13

• Many systems today want to use systems

that have both GPUs and high-speed

networks such as InfiniBand

• Problem: Lack of a common memory

registration mechanism

– Each device has to pin the host memory it will

use

– Many operating systems do not allow

multiple devices to register the same

memory pages

• Previous solution:

– Use different buffer for each device and copy

data DK-OSU HPC Advisory Council June'13 23

InfiniBand + GPU systems (Past)

• Collaboration between Mellanox and

NVIDIA to converge on one memory

registration technique

• Both devices register a common

host buffer

– GPU copies data to this buffer, and the network

adapter can directly read from this buffer (or

vice-versa)

• Note that GPU-Direct does not allow you to

bypass host memory

24

GPU-Direct

DK-OSU HPC Advisory Council June'13

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance

25 DK-OSU HPC Advisory Council June'13

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity

26 DK-OSU HPC Advisory Council June'13

Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new design was incorporated in MVAPICH2 to support this

functionality

27 DK-OSU HPC Advisory Council June'13

At Sender:

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity

28

MPI_Send(s_device, size, …);

DK-OSU HPC Advisory Council June'13

MPI Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)

Message Size (bytes)

Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

29 DK-OSU HPC Advisory Council June'13

Better

30

Application-Level Evaluation (LBM and AWP-ODC)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 1D LBM-CUDA: one process/GPU per node, 16 nodes, 4 groups data grid

•AWP-ODC (Courtesy: Yifeng Cui, SDSC) • A seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

1D LBM-CUDA

DK-OSU HPC Advisory Council June'13

0102030405060708090

1 GPU/Proc per Node 2 GPUs/Procs per NodeTota

l Ex

ecu

tio

n T

ime

(se

c)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

31

MPI_Alltoall

P2P Comm.

P2P Comm.

P2P Comm.

P2P Comm.

DMA: data movement from device

to host

RDMA: Data transfer to

remote node over network

DMA: data movement

from host to device

N2

Pipelined point-to-point communication

optimizes this

Need for optimization at the algorithm level

Optimizing Collective Communication

DK-OSU HPC Advisory Council June'13

0

2000

4000

6000

8000

10000

12000

14000

16000

128K 256K 512K 1M 2M

Tim

e (

us)

Message Size

No MPI Level Optimization

Collective Level Optimization

32

46%

Alltoall Latency Performance (Large Messages)

DK-OSU HPC Advisory Council June'13

A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011

• 8 node Westmere cluster with NVIDIA Tesla C2050 and IB QDR

Better

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 33

Outline

DK-OSU HPC Advisory Council June'13

Non-contiguous Data Exchange

34

• Multi-dimensional data

– Row based organization

– Contiguous on one dimension

– Non-contiguous on other

dimensions

• Halo data exchange

– Duplicate the boundary

– Exchange the boundary in each

iteration

Halo data exchange

DK-OSU HPC Advisory Council June'13

Datatype Support in MPI

• Native datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

DK-OSU HPC Advisory Council June'13 35

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);

MPI_Type_commit(&new_type);

MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• What will happen if the non-contiguous data is in the GPU device memory?

• Enhanced MVAPICH2 • Use data-type specific CUDA Kernels to pack data in chunks

• Pipeline pack/unpack, CUDA copies, and RDMA transfers

H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011

36

Application-Level Evaluation (LBMGPU-3D)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

50

100

150

200

250

300

350

400

8 16 32 64

Tota

l Exe

cuti

on

Tim

e (

sec)

Number of GPUs

MPI MPI-GPU5.6%

8.2%

13.5% 15.5%

3D LBM-CUDA

DK-OSU HPC Advisory Council June'13

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 37

Outline

DK-OSU HPC Advisory Council June'13

Multi-GPU Configurations

38

CPU

GPU 1 GPU 0

Memory

I/O Hub

Process 0 Process 1 • Multi-GPU node architectures are becoming common

• Until CUDA 3.2

– Communication between processes staged through the host

– Shared Memory (pipelined)

– Network Loopback [asynchronous)

• CUDA 4.0

– Inter-Process Communication (IPC)

– Host bypass

– Handled by a DMA Engine

– Low latency and Asynchronous

– Requires creation, exchange and mapping of memory handles - overhead

HCA

DK-OSU HPC Advisory Council June'13

39

Comparison of Costs

Co

py

Late

ncy

(u

sec)

50

100

150

200

CUDA IPC Copy

Copy Via Host

CUDA IPC Copy + Handle Creation &

Mapping Overhead

49 usec

3 usec

228 usec

• Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI)

• MVAPICH2 takes advantage of CUDA IPC while hiding the handle creation and mapping overheads from the user

DK-OSU HPC Advisory Council June'13

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Ban

dw

idth

(M

Bp

s)

Message Size (Bytes)

0

500

1000

1500

2000

4K 16K 64K 256K 1M 4M

Late

ncy

(u

sec)

Message Size (Bytes)

0

10

20

30

40

50

1 4 16 64 256 1K

Late

ncy

(u

sec)

Message Size (Bytes)

40

Two-sided Communication Performance

70% 46%

78% 0.0

10.0

20.0

30.0

40.0

1 4 16 64 256 1024

Late

ncy

(usec)

Message Size (Bytes)

SHARED-MEM CUDA IPC

DK-OSU HPC Advisory Council June'13

Already available in MVAPICH2 1.8 and 1.9

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Ban

dw

idth

(M

Bp

s)

Message Size (Bytes)

0

10

20

30

40

50

2 8 32 128 512

Late

ncy

(u

sec)

Message Size (Bytes)

41

One-sided Communication Performance (get + active synchronization vs. send/recv)

0

5

10

15

20

1 4 16 64 256 1K

Late

ncy

(usec)

Message Size (Bytes)

SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC

30%

27%

One-sided semantics harness better performance compared to two-sided semantics.

0

500

1000

1500

2000

4K 16K 64K 256K 1M 4M

Late

ncy

(u

sec)

Message Size (Bytes)

DK-OSU HPC Advisory Council June'13

Support for one-sided communication from GPUs will be available in future releases of MVAPICH2

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 42

Outline

DK-OSU HPC Advisory Council June'13

• OpenACC is gaining popularity

• Several sessions during GTC

• A set of compiler directives (#pragma)

• Offload specific loops or parallelizable sections in code onto accelerators #pragma acc region {

for(i = 0; i < size; i++) {

A[i] = B[i] + C[i];

} }

• Routines to allocate/free memory on accelerators buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

• Supported for C, C++ and Fortran

• Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

43 DK-OSU HPC Advisory Council June'13

• acc_malloc to allocate device memory – No changes to MPI calls

– MVAPICH2 detects the device pointer and optimizes data movement

– Delivers the same performance as with CUDA

Using MVAPICH2 with OpenACC 1.0

44

A = acc_malloc(sizeof(int) * N);

…...

#pragma acc parallel loop deviceptr(A) . . .

//compute for loop

MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);

……

acc_free(A);

DK-OSU HPC Advisory Council June'13

• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in

OpenACC 2.0 implementations

– MVAPICH2 will detect the device pointer and optimize communication

– Expected to deliver the same performance as with CUDA

Using MVAPICH2 with the new OpenACC 2.0

45 DK-OSU HPC Advisory Council June'13

A = malloc(sizeof(int) * N);

…...

#pragma acc data copyin(A) . . .

{

#pragma acc parallel loop . . .

//compute for loop

MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);

}

……

free(A);

• MVAPICH2/MVAPICH2-X Overview

– Efficient Intra-node and Inter-node Communication and Scalable Protocols

– Scalable Blocking and Non-blocking Collective Communication

– High Performance Fault Tolerance Mechanisms

– Support for Hybrid MPI+PGAS Programming models

• MVAPICH2 for GPU Clusters

– Point-to-point Communication

– Collective Communication

– MPI Datatype processing

– Multi-GPU Configurations

– MPI + OpenACC

• MVAPICH2 with GPUDirect RDMA

• Conclusion 46

Outline

DK-OSU HPC Advisory Council June'13

• Fastest possible communication

between GPU and other PCI-E

devices

• Network adapter can directly

read/write data from/to GPU

device memory

• Avoids copies through the host

• Allows for better asynchronous

communication

GPU-Direct RDMA with CUDA 5.0

InfiniBand

GPU

GPU Memory

CPU

Chip set

System Memory

DK-OSU HPC Advisory Council June'13 47

• Preliminary driver for GPU-Direct is under work by NVIDIA

and Mellanox

• OSU has done an initial design of MVAPICH2 with the latest

GPU-Direct-RDMA Driver

– Hybrid design

– Takes advantage of GPU-Direct-RDMA for short messages

– Uses host-based buffered design in current MVAPICH2 for large

messages

– Alleviates Sandybridge chipset bottleneck

DK-OSU HPC Advisory Council June'13 48

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA

MVAPICH2-GDR Alpha will be demonstrated at ISC’13 Exhibition Floor

(Mellanox Technologies, Booth #326)

DK-OSU HPC Advisory Council June'13 49

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency

0

5

10

15

20

25

30

35

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

0

100

200

300

400

500

600

700

800

900

1000

16K 64K 256K 1M 4M

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Better

Better

19.78

69 %

6.12

DK-OSU HPC Advisory Council June'13 50

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Uni-directional Bandwidth

0

100

200

300

400

500

600

700

800

900

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bandwidth

0

1000

2000

3000

4000

5000

6000

7000

16K 64K 256K 1M 4M

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bandwidth

3x 26%

Bet

ter

Bet

ter

DK-OSU HPC Advisory Council June'13 51

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bi-Bandwidth

3.13 x

0

2000

4000

6000

8000

10000

12000

16K 64K 256K 1M 4M

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bi-Bandwidth

51%

Bet

ter

Bet

ter

0

10

20

30

40

50

60

70

64 128 256Problem Size

Execution Time of HSG (2 MPI / Node)

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Tota

l Tim

e (

S.)

68% 57%

21%

0

10

20

30

40

50

60

64 128 256Problem Size

Execution Time of HSG (1 MPI / Node)

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Tota

l Tim

e (

S.)

38%

DK-OSU HPC Advisory Council June'13 52

Execution Time of HSG Application

Based on MVAPICH2-1.9; Intel Sandy Bridge (E5-2670) node with 16 cores; NVIDIA Telsa K20c GPU,; Mellanox ConnectX-3 FDR HCA; CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

Better

Better

Application run with two GPU Nodes

• Benefits small and medium message exchanges – good for strong scaling

• Context switching overheads in cudaMemcpy with multiple Procs/GPU

• GDR avoids these overheads

DK-OSU HPC Advisory Council June'13 53

Execution Time of HSG Application (Cont’d)

Better

0

10

20

30

40

50

60

70

80

90

64 128 256Problem Size

Execution Time of HSG (4 MPI / Node)

MVAPICH2-1.9

MVAPICH2-1.9-GDRTo

tal T

ime

(S.

)

78% 74%

38%

Application run with two GPU Nodes

Based on MVAPICH2-1.9; Intel Sandy Bridge (E5-2670) node with 16 cores; NVIDIA Telsa K20c GPU,; Mellanox ConnectX-3 FDR HCA; CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

MVAPICH2 Release with GPUDirect RDMA Hybrid

• Further tuning and optimizations (such as collectives) to be done

• MVAPICH2 release with GPUDirect RDMA support will be timed

with the release of OFED driver with GDR support by Mellanox

and NVIDIA.

54 DK-OSU HPC Advisory Council June'13

• MVAPICH2/MVAPICH2-X offer solutions to address

challenges on large-scale HPC clusters

• MVAPICH2 optimizes MPI communication on InfiniBand

clusters with GPUs

• Point-to-point, collective communication and datatype

processing are addressed

• Takes advantage of CUDA features like IPC and GPUDirect

RDMA

• Optimizations under the hood of MPI calls, hiding all the

complexity from the user

• High productivity and high performance

55

Conclusions

DK-OSU HPC Advisory Council June'13

DK-OSU HPC Advisory Council June'13 57

MVAPICH User Group (MUG) Meeting August 26-27, 2013, Columbus, Ohio, U.S.A • The MUG meeting will provide an open forum for all attendees (users, researchers,

system administrators, engineers, and students) to share their knowledge about

MVAPICH2/MVAPICH2-X on large-scale systems and diverse applications.

• The event includes:

• Talks from experts in the field

• Presentations from the MVAPICH team on tuning and optimization strategies

• Troubleshooting guidelines

• Contributed presentations

• Open mic session

• Interactive one-on-one session with the MVAPICH developers

Call for Presentation

• The MVAPICH team is requesting the submission of presentations from MVAPICH2 and

MVAPICH2-X users to be included in the event.

Presentation Submission Deadline: July 1, 2013

Notification of Acceptance: July 8, 2013

Advanced Registration Deadline: July 15, 2013

The preliminary program has been posted at mug.mvapich.cse.ohio-state.edu/program/