communication frameworks for hpc and big data€¦ · systems 2015 tianhe-2 2020-2024 difference...

Communication Frameworks for HPC and Big Data

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Spain Conference (2015)

by

http://www.cse.ohio-state.edu/~panda

High-End Computing (HEC): PetaFlop to ExaFlop

HPCAC Spain Conference (Sept '15) 2

100-200

PFlops in

2016-2018

1 EFlops in

2020-2024?

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

3HPCAC Spain Conference (Sept '15)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

87%

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters•

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand<1usec latency, >100Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

4

Multi-core Processors

HPCAC Spain Conference (Sept '15)

• 259 IB Clusters (51%) in the June 2015 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (24 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd)

185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th)

72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th)

72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th)

265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd)

72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th)

115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th)

147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd)

86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more!


http://www.top500.org/

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

6

Two Major Categories of Applications


Towards Exascale System (Today and Target)

Systems 2015Tianhe-2

2020-2024 DifferenceToday & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW(3 Gflops/W)

~20 MW(50 Gflops/W)

O(1)~15x

System memory 1.4 PB(1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s(0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra



Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

9

Partitioned Global Address Space (PGAS) Models


• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

• Chapel

- Libraries

• OpenSHMEM

• Global Arrays

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS


Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-coreArchitectures

11

Accelerators(NVIDIA and MIC)

Middleware


Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Communication Library or Runtime for Programming Models

Point-to-point

Communication

(two-sided and

one-sided

Collective

Communication

Energy-

Awareness

Synchronization

and Locks

I/O and

File Systems

Fault

Tolerance

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

• Scalable Collective communication– Offload

– Non-blocking

– Topology-aware

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Integrated Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)

• Virtualization

• Energy-Awareness

Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale


• Extreme Low Memory Footprint– Memory per core continues to decrease

• D-L-A Framework

– Discover

• Overall network topology (fat-tree, 3D, …)

• Network topology for processes for a given job

• Node architecture

• Health of network and node

– Learn

• Impact on performance and scalability

• Potential for failure

– Adapt

• Internal protocols and algorithms

• Process mapping

• Fault-tolerance solutions

– Low overhead techniques while delivering performance, scalability and fault-tolerance

Additional Challenges for Designing Exascale Software Libraries


• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over

Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Used by more than 2,450 organizations in 76 countries

– More than 286,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘15 ranking)

• 8th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 185,344-core cluster (Pleiades) at NASA

• 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops)

MVAPICH2 Software


http://mvapich.cse.ohio-state.edu/


MVAPICH2 Software Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE

MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA


and one-sided RMA)

– Extremely minimal memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization


• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale



Latency & Bandwidth: MPI over IB with MVAPICH2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.261.19

1.051.15

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


12465

3387

6356

11497

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

18

Latest MVAPICH2 2.2a

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s



Minimizing Memory Footprint further by DC Transport

No

de

0

P1

P0

Node 1

P3

P2

Node 3

P7

P6

No

de

2

P5

P4

IBNetwork

• Constant connection cost (One QP for any peer)

• Full Feature Set (RDMA, Atomics etc)

• Separate objects for send (DC Initiator) and receive (DC Target)

– DC Target identified by “DCT Number”

– Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

• Available with MVAPICH2-X 2.2a

• DCT support available in Mellanox OFED

0

0.5

1

160 320 620

No

rmal

ize

d E

xecu

tio

n

Tim

e

Number of Processes

NAMD - Apoa1: Large data setRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 1

35

1

10

100

80 160 320 640

Co

nn

ect

ion

Me

mo

ry (

KB

)

Number of Processes

Memory Footprint for Alltoall

RC DC-Pool

UD XRC

H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected

Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).


and one-sided RMA)

– Extremely minimum memory footprint






• Virtualization



Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale



Hardware Multicast-aware MPI_Bcast on Stampede

05

10152025303540

2 8 32 128 512

Late

ncy

(u

s)


Small Messages (102,400 Cores)

Default

Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

050

100150200250300350400450

2K 8K 32K 128K

Late

ncy

(u

s)


Large Messages (102,400 Cores)

Default

Multicast

0

5

10

15

20

25

30

Late

ncy

(u

s)

Number of Nodes

16 Byte Message

Default

Multicast

0

50

100

150

200

Late

ncy

(u

s)

Number of Nodes

32 KByte Message

Default

Multicast

0

1

2

3

4

5

512 600 720 800Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available with MVAPICH2-X 2.2a)

22

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011


17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective

Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12


and one-sided RMA)






• MPI-T Interface


• Virtualization





PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver:

MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance


MPI + CUDA - Naive

PCIe

GPU

CPU

NIC

Switch

At Sender:for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,

…);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance


MPI + CUDA - Advanced

At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);


GPU-Aware MPI Library: MVAPICH2-GPU

CUDA-Aware MPI: MVAPICH2 1.8-2.2 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers


• OFED with support for GPUDirect RDMA is

developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using GPUDirect

RDMA

– Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on

SandyBridge and IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX

VPI adapters

– Support for RoCE with Mellanox ConnectX VPI

adapters


GPU-Direct RDMA (GDR) with CUDA

IB Adapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s

P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

29

Performance of MVAPICH2-GDR with GPU-Direct-RDMA


MVAPICH2-GDR-2.1RC2Intel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-IB Dual-FDR HCA

CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.1RC2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Small Message Latency


Late

ncy

(u

s)

3.5X9.3X

2.15 usec

0

500

1000

1500

2000

2500

3000

3500

1 4 16 64 256 1K 4K


GPU-GPU Internode MPI Uni-Directional Bandwidth


Ban

dw

idth

(M

B/s

)

11x

2X

0500

10001500200025003000350040004500

1 4 16 64 256 1K 4K


GPU-GPU Internode Bi-directional Bandwidth


Bi-

Ban

dw

idth

(M

B/s

)

11x

2x


Application-Level Evaluation (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0

MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768

MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

• MVAPICH2-GDR 2.1RC2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download

• System software requirements

• Mellanox OFED 2.1 or later

• NVIDIA Driver 331.20 or later

• NVIDIA CUDA Toolkit 6.0 or later

• Plugin for GPUDirect RDMA

– http://www.mellanox.com/page/products_dyn?product_family=116

– Strongly Recommended : use the new GDRCOPY module from NVIDIA

• https://github.com/NVIDIA/gdrcopy

• Has optimized designs for point-to-point communication using GDR

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GDR Version

31HPCAC Brazil Conference (Aug ‘15)

https://mvapich.cse.ohio-state.edu/download

http://www.mellanox.com/page/products_dyn?product_family=116

mailto:[email protected]