mvapich2: a high performance mpi library | gtc 2013...• high performance open-source mpi library...

53
MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Presentation at GTC 2013 by

Upload: others

Post on 24-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Presentation at GTC 2013

by

Page 2: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Growth of High Performance Computing

(HPC)

– Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking

• Increase in speed/features + reducing cost

– Growth in accelerators (NVIDIA GPUs)

Current and Next Generation HPC Systems and Applications

2 DK-OSU-GTC '13

Page 3: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

3

Trends for Commodity Computing Clusters in the Top 500 Supercomputer List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

DK-OSU-GTC '13

Page 4: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• 224 IB Clusters (44.8%) in the November 2012 Top500 list

(http://www.top500.org)

• Installations in the Top 20 (9 systems), Two use NVIDIA GPUs

Large-scale InfiniBand Installations

147, 456 cores (Super MUC) in Germany (6th) 125,980 cores (Pleiades) at NASA/Ames (14th)

204,900 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (15th)

77,184 cores (Curie thin nodes) at France/CEA (11th) 73,278 cores (Tsubame 2.0) at Japan/GSIC (17th)

120, 640 cores (Nebulae) at China/NSCS (12th) 138,368 cores (Tera-100) at France/CEA (20th)

72,288 cores (Yellowstone) at NCAR (13th)

4

• 54 of the InfiniBand clusters (in TOP500) house accelerators/co-

processors and 42 of them have NVIDIA GPUs

DK-OSU-GTC '13

Page 5: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

5

Outline

DK-OSU-GTC '13

Page 6: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Many systems today want to use systems

that have both GPUs and high-speed

networks such as InfiniBand

• Problem: Lack of a common memory

registration mechanism

– Each device has to pin the host memory it will

use

– Many operating systems do not allow

multiple devices to register the same

memory pages

• Previous solution:

– Use different buffer for each device and copy

data DK-OSU-GTC '13 6

InfiniBand + GPU systems (Past)

Page 7: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Collaboration between Mellanox and

NVIDIA to converge on one memory

registration technique

• Both devices register a common

host buffer

– GPU copies data to this buffer, and the network

adapter can directly read from this buffer (or

vice-versa)

• Note that GPU-Direct does not allow you to

bypass host memory

7

GPU-Direct

DK-OSU-GTC '13

Page 8: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance

8 DK-OSU-GTC '13

Page 9: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity

9 DK-OSU-GTC '13

Page 10: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new design was incorporated in MVAPICH2 to support this

functionality

10 DK-OSU-GTC '13

Page 11: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in

70 countries

– More than 160,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 204,900-core cluster (Stampede) at TACC

• 14th ranked 125,980-core cluster (Pleiades) at NASA

• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 75th ranked 16,896-core cluster (Keenland) at GaTech

• and many others

– Available with software stacks of many IB, HSE and server vendors

including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

11

MVAPICH2/MVAPICH2-X Software

DK-OSU-GTC '13

Page 12: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

12

Outline

DK-OSU-GTC '13

Page 13: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

At Sender:

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity

13

MPI_Send(s_device, size, …);

DK-OSU-GTC '13

Page 14: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

MPI Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)

Message Size (bytes)

Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

14 DK-OSU-GTC '13

Better

Page 15: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

15

Application-Level Evaluation (LBM and AWP-ODC)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 1D LBM-CUDA: one process/GPU per node, 16 nodes, 4 groups data grid

•AWP-ODC (Courtesy: Yifeng Cui, SDSC) • A seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

1D LBM-CUDA

DK-OSU-GTC '13

0 10 20 30 40 50 60 70 80 90

1 GPU/Proc per Node 2 GPUs/Procs per Node Tota

l Ex

ecu

tio

n T

ime

(se

c)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

Page 16: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

16

Outline

DK-OSU-GTC '13

Page 17: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

17

MPI_Alltoall

P2P Comm.

P2P Comm.

P2P Comm.

P2P Comm.

DMA: data movement from device

to host

RDMA: Data transfer to

remote node over network

DMA: data movement

from host to device

N2

Pipelined point-to-point communication

optimizes this

Need for optimization at the algorithm level

Optimizing Collective Communication

DK-OSU-GTC '13

Page 18: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

0

2000

4000

6000

8000

10000

12000

14000

16000

128K 256K 512K 1M 2M

Tim

e (

us)

Message Size

No MPI Level Optimization

Collective Level Optimization

18

46%

Alltoall Latency Performance (Large Messages)

DK-OSU-GTC '13

A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011

• 8 node Westmere cluster with NVIDIA Tesla C2050 and IB QDR

Better

Page 19: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype Processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

19

Outline

DK-OSU-GTC '13

Page 20: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Non-contiguous Data Exchange

20

• Multi-dimensional data

– Row based organization

– Contiguous on one dimension

– Non-contiguous on other

dimensions

• Halo data exchange

– Duplicate the boundary

– Exchange the boundary in each

iteration

Halo data exchange

DK-OSU-GTC '13

Page 21: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Datatype Support in MPI

• Native datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

DK-OSU-GTC '13 21

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);

MPI_Type_commit(&new_type);

MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• What will happen if the non-contiguous data is in the GPU device memory?

• Enhanced MVAPICH2 • Use data-type specific CUDA Kernels to pack data in chunks

• Pipeline pack/unpack, CUDA copies, and RDMA transfers

H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011

Page 22: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

22

Application-Level Evaluation (LBMGPU-3D)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

50

100

150

200

250

300

350

400

8 16 32 64

Tota

l Exe

cuti

on

Tim

e (

sec)

Number of GPUs

MPI MPI-GPU 5.6%

8.2%

13.5% 15.5%

3D LBM-CUDA

DK-OSU-GTC '13

Page 23: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

MVAPICH2 1.8 and 1.9 Series

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

23 DK-OSU-GTC '13

Page 24: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

OSU MPI Micro-Benchmarks (OMB) 3.5 – 3.9 Releases

• A comprehensive suite of benchmarks to compare performance

of different MPI stacks and networks

• Enhancements to measure MPI performance on GPU clusters

– Latency, Bandwidth, Bi-directional Bandwidth

• Flexible selection of data movement between CPU(H) and

GPU(D): D->D, D->H and H->D

• Extensions with OpenACC is added in 3.9 Release

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

24 DK-OSU-GTC '13

• D. Bureddy, H. Wang, A. Venkatesh, S. Potluri and D. K. Panda, OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters, EuroMPI 2012, September 2012.

Page 25: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype Processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

25

Outline

DK-OSU-GTC '13

Page 26: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Fastest possible communication

between GPU and other PCI-E

devices

• Network adapter can directly

read/write data from/to GPU

device memory

• Avoids copies through the host

• Allows for better asynchronous

communication

GPU-Direct RDMA with CUDA 5.0

InfiniBand

GPU

GPU Memory

CPU

Chip set

System Memory

DK-OSU-GTC '13 26

Page 27: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Preliminary driver for GPU-Direct is under work by NVIDIA

and Mellanox

• OSU has done an initial design of MVAPICH2 with the latest

GPU-Direct-RDMA Driver

27

Initial Design of MVAPICH2 with GPU-Direct-RDMA

DK-OSU-GTC '13

Page 28: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Performance evaluation has been carried out on four

platform configurations:

– Sandy Bridge, IB FDR, K20C

– WestmereEP, IB FDR, K20C

– Sandy Bridge, IB QDR, K20C

– WestmereEP, IB QDR, K20C

28

Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA

DK-OSU-GTC '13

Page 29: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency - Sandy Bridge + K20C + IB FDR

0

5

10

15

20

25

30

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

6.54

69%

0

200

400

600

800

1000

1200

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

Message Size (Bytes)

Late

ncy

(u

s)

DK-OSU-GTC '13 29

20.92

Better

Better

Page 30: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Uni-directional Bandwidth - Sandy Bridge + K20C + IB FDR

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

0

1000

2000

3000

4000

5000

6000

7000

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13 30

3x

26%

Bet

ter

Bet

ter

Page 31: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Bi-directional Bandwidth - Sandy Bridge + K20C + IB FDR

0

200

400

600

800

1000

1200

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bi-Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bi-Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13 31

Bet

ter

Bet

ter

3.25x

31%

Page 32: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Performance evaluation has been carried out on four

platform configurations:

– Sandy Bridge, IB FDR, K20C

– WestmereEP, IB FDR, K20C

– Sandy Bridge, IB QDR, K20C

– WestmereEP, IB QDR, K20C

32

Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA

DK-OSU-GTC '13

Page 33: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency - WestmereEP + K20C + IB FDR

0

5

10

15

20

25

30

35

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

DK-OSU-GTC '13 33

9.93

24.92

Better

Better 60%

Page 34: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

GPU-GPU Internode MPI Uni-directional Bandwidth - WestmereEP + K20C + IB FDR

0

200

400

600

800

1000

1200

1400

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

0

500

1000

1500

2000

2500

3000

3500

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13 34

5.8x

30%

Bet

ter

Bet

ter

Page 35: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

GPU-GPU Internode MPI Bi-directional Bandwidth - WestmereEP + K20C + IB FDR

Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

0

200

400

600

800

1000

1200

1400

1600

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bi-Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

0

1000

2000

3000

4000

5000

6000

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bi-Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13 35

Bet

ter

Bet

ter

6x

90%

Page 36: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Performance evaluation has been carried out on four

platform configurations:

– Sandy Bridge, IB FDR, K20C

– WestmereEP, IB FDR, K20C

– Sandy Bridge, IB QDR, K20C

– WestmereEP, IB QDR, K20C

36

Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA

DK-OSU-GTC '13

Page 37: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

37

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency - Sandy Bridge + K20C + IB QDR

0

5

10

15

20

25

30

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

Message Size (Bytes)

Late

ncy

(u

s)

0

200

400

600

800

1000

1200

1400

1600

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

Message Size (Bytes)

Late

ncy

(u

s)

DK-OSU-GTC '13

6.85

21.46

Better

Better

68%

Page 38: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

38

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Uni-directional Bandwidth - Sandy Bridge + K20C + IB QDR

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

0

500

1000

1500

2000

2500

3000

3500

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13

Bet

ter

Bet

ter

3x 40%

Page 39: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Performance evaluation has been carried out on four

platform configurations:

– Sandy Bridge, IB FDR, K20C

– WestmereEP, IB FDR, K20C

– Sandy Bridge, IB QDR, K20C

– WestmereEP, IB QDR, K20C

39

Preliminary Performance Evaluation of OSU-MVAPICH2 with GPU-Direct-RDMA

DK-OSU-GTC '13

Page 40: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

40

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency - WestmereEP + K20C + IB QDR

0

5

10

15

20

25

30

35

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

0

200

400

600

800

1000

1200

1400

1600

1800

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Latency

Message Size (Bytes)

Late

ncy

(u

s)

DK-OSU-GTC '13

Better

Better

10.43

25.75

60%

Page 41: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

41

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

GPU-GPU Internode MPI Uni-directional Bandwidth - WestmereEP + K20C + IB QDR

Based on MVAPICH2-1.9b WestmereEP (E5645) node with 12 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-2 QDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

0

200

400

600

800

1000

1200

1400

1 4 16 64 256 1K 4K

MV2

MV2-GDR-Hybrid

Small Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

0

500

1000

1500

2000

2500

3000

3500

8K 32K 128K 512K 2M

MV2

MV2-GDR-Hybrid

Large Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/S

)

DK-OSU-GTC '13

Bet

ter

Bet

ter

6x

30%

Page 42: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

MVAPICH2 Release with GPUDirect RDMA Hybrid

• Further tuning and optimizations (such as collectives) to be done

• GPUDirect RDMA support in Open Fabrics Enterprise

Distribution (OFED) is expected during Q2 ‘13 (according to

Mellanox)

• MVAPICH2 release with GPUDirect RDMA support will be timed

accordingly

42 DK-OSU-GTC '13

Page 43: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype Processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

43

Outline

DK-OSU-GTC '13

Page 44: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

Multi-GPU Configurations

44

CPU

GPU 1 GPU 0

Memory

I/O Hub

Process 0 Process 1 • Multi-GPU node architectures are becoming common

• Until CUDA 3.2

– Communication between processes staged through the host

– Shared Memory (pipelined)

– Network Loopback [asynchronous)

• CUDA 4.0

– Inter-Process Communication (IPC)

– Host bypass

– Handled by a DMA Engine

– Low latency and Asynchronous

– Requires creation, exchange and mapping of memory handles - overhead

HCA

DK-OSU-GTC '13

Page 45: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

45

Comparison of Costs

Co

py

Late

ncy

(u

sec)

50

100

150

200

CUDA IPC Copy

Copy Via Host

CUDA IPC Copy + Handle Creation &

Mapping Overhead

49 usec

3 usec

228 usec

• Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI)

• MVAPICH2 takes advantage of CUDA IPC while hiding the handle creation and mapping overheads from the user

DK-OSU-GTC '13

Page 46: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Ban

dw

idth

(M

Bp

s)

Message Size (Bytes)

0

500

1000

1500

2000

4K 16K 64K 256K 1M 4M

Late

ncy

(u

sec)

Message Size (Bytes)

0

10

20

30

40

50

1 4 16 64 256 1K

Late

ncy

(u

sec)

Message Size (Bytes)

46

Two-sided Communication Performance

70% 46%

78% 0.0

10.0

20.0

30.0

40.0

1 4 16 64 256 1024

Late

ncy

(usec)

Message Size (Bytes)

SHARED-MEM CUDA IPC

DK-OSU-GTC '13

Already available in MVAPICH2 1.8 and 1.9

Page 47: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Ban

dw

idth

(M

Bp

s)

Message Size (Bytes)

0

10

20

30

40

50

2 8 32 128 512

Late

ncy

(u

sec)

Message Size (Bytes)

47

One-sided Communication Performance (get + active synchronization vs. send/recv)

0

5

10

15

20

1 4 16 64 256 1K

Late

ncy

(usec)

Message Size (Bytes)

SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC

30%

27%

one-sided semantics harness better performance compared to two-sided semantics.

0

500

1000

1500

2000

4K 16K 64K 256K 1M 4M

Late

ncy

(u

sec)

Message Size (Bytes)

DK-OSU-GTC '13

Support for one-sided communication from GPUs will be available in future releases of MVAPICH2

Page 48: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU

– Internode Communication

• Point-to-point Communication

• Collective Communication

• MPI Datatype Processing

• Using GPUDirect RDMA

– Multi-GPU Configurations

• MPI and OpenACC

• Conclusion

48

Outline

DK-OSU-GTC '13

Page 49: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• OpenACC is gaining popularity

• Several sessions during GTC

• A set of compiler directives (#pragma)

• Offload specific loops or parallelizable sections in code onto accelerators #pragma acc region {

for(i = 0; i < size; i++) {

A[i] = B[i] + C[i];

} }

• Routines to allocate/free memory on accelerators buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

• Supported for C, C++ and Fortran

• Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

49 DK-OSU-GTC '13

Page 50: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• acc_malloc to allocate device memory – No changes to MPI calls

– MVAPICH2 detects the device pointer and optimizes data movement

– Delivers the same performance as with CUDA

Using MVAPICH2 with OpenACC 1.0

50

A = acc_malloc(sizeof(int) * N);

…...

#pragma acc parallel loop deviceptr(A) . . .

//compute for loop

MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);

……

acc_free(A);

DK-OSU-GTC '13

Page 51: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in

OpenACC 2.0 implementations

– MVAPICH2 will detect the device pointer and optimize communication

– Expected to deliver the same performance as with CUDA

Using MVAPICH2 with the new OpenACC 2.0

51 DK-OSU-GTC '13

A = malloc(sizeof(int) * N);

…...

#pragma acc data copyin(A) . . .

{

#pragma acc parallel loop . . .

//compute for loop

MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);

}

……

free(A);

Page 52: MVAPICH2: A High Performance MPI Library | GTC 2013...• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) –

• MVAPICH2 optimizes MPI communication on InfiniBand

clusters with GPUs

• Point-to-point, collective communication and datatype

processing are addressed

• Takes advantage of CUDA features like IPC and GPUDirect

RDMA

• Optimizations under the hood of MPI calls, hiding all the

complexity from the user

• High productivity and high performance

52

Conclusions

DK-OSU-GTC '13