the future of supercomputer software libraries€¦ · high-end computing (hec): petaflop to...

The Future of Supercomputer Software Libraries

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Israel Supercomputing Conference

by




High-End Computing (HEC): PetaFlop to ExaFlop

HPC Israel (Feb '12)

Expected to have an ExaFlop system in 2019 !

2

20-30

PFlops in

2013

100

PFlops in

2016


Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)

Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)

Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 410/500 (82.0%)

Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 410/500 (82.0%)

Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)

Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 2010: 424/500 (84.8%)

Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 2010: 415/500 (83%)

Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)

Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 410/500 (82.0%)

Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%)

Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)

3

• 209 IB Clusters (41.8%) in the November‘11 Top500 list

(http://www.top500.org)

• Installations in the Top 30 (13 systems):


Large-scale InfiniBand Installations

120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)

73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)

111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)

138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)

122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)

137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !

46,208 cores (Zin) at LLNL (15th)

33,072 cores (Lomonosov) in Russia (18th)

4

http://www.top500.org/

• Scientific Computing

– Message Passing Interface (MPI) is the Dominant Programming

Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of

momentum

– Memcached is also used

5

Two Major Categories of Applications


Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models Message Passing Interface (MPI), Sockets, PGAS (UPC, Global Arrays), Hadoop and MapReduce

Applications/Libraries

Networking Technologies (InfiniBand, 1/10/40GigE, RNICs & Intelligent NICs)

Commodity Computing System Architectures

(single, dual, quad, ..) Multi/Many-core architecture and

Accelerators


Point-to-point Communication QoS

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

6

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Hybrid programming (MPI + OpenMP, MPI + UPC, …)

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Support for GPGPUs and Accelerators

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

– Power-aware

• Fault-tolerance/resiliency

• QoS support for communication and I/O

Designing (MPI+X) at Exascale

7 HPC Israel (Feb '12)

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002

– Used by more than 1,840 organizations (HPC Centers, Industry and Universities) in

65 countries

– More than 95,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 7th ranked 111,104-core cluster (Pleiades) at NASA

• 25th ranked 62,976-core cluster (Ranger) at TACC

• and many others

– Available with software stacks of many InfiniBand, High-speed Ethernet and server

vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros

(RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System

8

MVAPICH/MVAPICH2 Software


http://mvapich.cse.ohio-state.edu/



• High Performance and Scalable Inter-node Communication with Hybrid UD-

RC/XRC transport

• Kernel-based zero-copy intra-node communication

• Collective Communication

– Multi-core Aware, Topology-aware and Power-aware

– Exploiting Collective Offload


• PGAS Support (Hybrid MPI + UPC)

Approaches being used in MVAPICH2 for Exascale


HPC Israel (Feb '12) 10

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR

Late

ncy

(u

s)


Large Message Latency

2.4 GHz Quad-core (Westmere) Intel with IB switch


Bandwidth: MPI over IB

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

4000.00Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

1917

1706

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3341

3704

4407

6521

2.4 GHz Quad-core (Westmere) Intel with IB switch

12

IB Transport Services

Service Type Connection

Oriented Acknowledged Transport

Reliable Connection (RC) Yes Yes IBA

Unreliable Connection (UC) Yes No IBA

Reliable Datagram (RD) No Yes IBA

Unreliable Datagram (UD) No No IBA

RAW Datagram No No Raw


• eXtended Reliable Connection (XRC) has been added later for

multi-core platforms


UD vs. RC: Performance and Scalability (SMG2000 Application)

0

0.2

0.4

0.6

0.8

1

1.2

128 256 512 1024 2048 4096

No

rma

lize

d T

ime

Processes

RC UD

RC (MVAPICH 0.9.8) UD Design

Conn. Buffers Struct. Total Buffers Struct Total

512 22.9 65.0 0.3 88.2 37.0 0.2 37.2

1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4

2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9

4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7

Memory Usage (MB/process)

M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-

Scale InfiniBand Clusters,” ICS ‘07

13

• Both UD and RC/XRC have benefits

• User transparent, automatically adjusts between UD-RC/XRC

• Runtime options for advanced users to extract highest performance and

scalability

• Available in MVAPICH 1.1 and 1.2 for some time (as a separate interface)

• Available since MVAPICH2 1.7 in an integrated manner with Gen2 interface


Hybrid Transport Design (UD-RC/XRC)

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over

InfiniBand,” IPDPS ‘08

14

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)

Message Size (byte)

Bandwidth

intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support)


Latest MVAPICH2 1.8a2

Intel Westmere

Intra-socket:

- 0.19 microseconds for 4bytes

Inter-socket:

- 0.45 microseconds for 4bytes

10,000MB/s 18,500MB/s

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1 4 16 64 256 1K 4K 6K 64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)

Message Size (byte)

Bi-Directional Bandwidth intra-Socket-LiMIC

intra-Socket-Shmem

inter-Socket-LiMIC

inter-Socket-Shmem

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (byte)

Latency Intra-Socket

Inter-Socket


RC/XRC transport







MPI Design Challenges


Shared-memory Aware Collectives (4K cores on TACC Ranger with MVAPICH2)

0

20

40

60

80

100

120

140

160

0 4 8 16 32 64 128 256 512

Late

ncy

(u

s)


MPI_Reduce (4096 cores)

Original

Shared-memory


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

Late

ncy

(u

s)


MPI_ Allreduce (4096 cores)

Original

Shared-memory

17

Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

New Job

- New Job


Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

• Individual processes belonging to

one job can get scattered

• Primary responsibility of scheduler is

to keep system throughput high

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

- New Job



Topology-Aware Collectives

Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes

20

K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, “Designing Topology-Aware Collective Communication

Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather,” CAC ‘10

0.E+00

1.E+05

2.E+05

3.E+05

4.E+05

5.E+05

2K

4K

8K

16

K

32

K

64

K

12

K

25

6K

51

2K

12

8K

Late

ncy

(u

sec)

Message Size (Bytes)

Scatter-Default

Scatter-Topo-Aware

0.E+001.E+052.E+053.E+054.E+055.E+056.E+057.E+05

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

12

8KLa

ten

cy (

use

c)


Gather-Default

Gather-Topo-Aware

0

20

40

60

80

100

2R 4R 8R 16R 32R

Late

ncy

(u

sec)

System Size – Number of Racks

Default

Topo-Aware

Estimated Latency Of Default and Topology Aware Algorithms for small messages And Varying System Sizes

22% 54%

Impact of Network-Topology Aware Algorithms on Broadcast Performance

0

5

10

15

20

25

30

35

40

128K 256K 512K 1M

Late

ncy

(m

s)


No-Topo-Aware

Topo-Aware-DFT

Topo-Aware-BFT


0

0.2

0.4

0.6

0.8

1

1.2

128 256 512 1K

No

rmal

ize

d L

ate

ncy

Job Size (Processes)

• Impact of topology aware schemes more pronounced as

• Message size increases

• System size scales

• Upto 14% improvement in performance at scale

14%

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11

5.E+03

1.E+05

2.E+05

3.E+05

4.E+058

K

16

K

32

K

64

K

12

8K

25

6K

51

2K

12

8K

25

6K

Late

ncy

(u

sec)


Default

DVFS

Proposed


Power-Aware Collectives

11

13

15

17

19

21

0.7

5.0

9.4

13

.7

18

.0

22

.3

26

.6

31

.0

35

.3

39

.6

43

.9

48

.2

52

.6

Po

we

r (

KW

)

Time (s)

Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

1.4E+13

5.014E+15

1.001E+16

1.501E+16

2.001E+16

2.501E+16

8K 16K 32K 64K 128K 256K

Ene

rgy

(KJ)

System Size (Number of Processes)

Estimated Energy Consumption during an MPI_Alltoall operation with 128K Message size and Varying System Size

30%

32%

K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, “Designing Power Aware Collective Communication

Algorithms for Infiniband Clusters, ICPP ‘10

5%

Application

Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send)

• Sender creates a task-list consisting of only

send and wait WQEs

– One send WQE is created for each registered

receiver and is appended to the rear of a

singly linked task-list

– A wait WQE is added to make the ConnectX-2

HCA wait for ACK packet from the receiver


InfiniBand HCA

Physical Link

Send Q

Recv Q

Send CQ

Recv CQ

Data Data

MC

Q MQ

23

Task List Send Wait Send Send Send Wait

P3DFFT Application Performance with Non-Blocking Alltoall based on CX-2 Collective Offload

24

0123456789

10

512 600 720 800

Ap

plic

atio

n R

un

-Tim

e (

s)

Blocking Host-Test OffloadData Size

00.5

11.5

22.5

33.5

44.5

5

512 600 720 800

Ap

plic

atio

n R

un

-Tim

e (

s)

Data Size

P3DFFT Application Run-time Comparison. Overlap version with Offload-Alltoall does up to 17% better than default blocking version

K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable

Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, Int'l

Supercomputing Conference (ISC), June 2011.

64 Processes 128 Processes


Non-Blocking Broadcast with Collective

Offload and Impact on HPL Performance

25

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70

No

rmal

ize

d H

PL

Pe

rfo

rman

ce

HPL Problem Size (N) as % of Total Memory

HPL-Offload HPL-1ring HPL-Host

HPL Performance Comparison with 512 Processes

4.5%

K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast

with Collective Offload on InfiniBand Clusters: A Case Study with HPL, Hot Interconnect '11, Aug. 2011.


Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload

26

02468

10121416

64 128 256 512

Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Experimental Setup: • 64 nodes, 8 core Intel Xeon(2.53 GHz) 12MB L3 Cache, 12 GB Memory per node, 64 nodes • MT26428 QDR ConnectX-2, PCI-Ex interfaces, 171-port Mellanox

64,000 unknowns per process. Modified PCG with Offload-Allreduce has a run-time

which is about 21% lesser than default PCG

21.8%

K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, Accepted for publication at IPDPS ‘12.



RC/XRC transport









Data movement in GPU+IB clusters

• Many systems today want to use systems that have both GPUs and high-speed networks such as InfiniBand

• Steps in Data movement in InfiniBand clusters with GPUs

– From GPU device memory to main memory at source process, using CUDA

– From source to destination process, using MPI

– From main memory to GPU device memory at destination process, using CUDA

• Earlier, GPU device and InfiniBand device required separate memory registration

• GPU-Direct (collaboration between NVIDIA and Mellanox) supported common registration between these devices

• However, GPU-GPU communication is still costly and programming is harder


PCIe

GPU

CPU

NIC

Switch

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance


PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity


Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new Design incorporated in MVAPICH2 to support this

functionality


At Sender:

MPI_Send(s_device, size, …);

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• High Performance and High Productivity


Design considerations

• Memory detection

– CUDA 4.0 introduces Unified Virtual Addressing (UVA)

– MPI library can differentiate between device memory and

host memory without any hints from the user

• Overlap CUDA copy and RDMA transfer

– Data movement from GPU and RDMA transfer are DMA

operations

– Allow for asynchronous progress


MPI-Level Two-sided Communication

• 45% and 38% improvements compared to Memcpy+Send, with and without

GPUDirect respectively, for 4MB messages

• 24% and 33% improvement compared with MemcpyAsync+Isend, with and without

GPUDirect respectively, for 4MB messages

with GPU Direct

without GPU Direct

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)


Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)


Memcpy-Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11


Other MPI Operations from GPU Buffers

• Similar approaches can be used for

– One-sided

– Collectives

– Communication with Datatypes

• Designs can also be extended for multi-GPUs per node

35

H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011.


A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011.

S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES) to be held in conjunction with IPDPS 2012

MVAPICH2 1.8a2 Release

• Supports point-to-point and collective communication

• Supports communication between GPU devices and

between GPU device and host

• Supports communication using contiguous and non-

contiguous MPI Datatypes

• Supports GPU-Direct through CUDA support (from 4.0)

• Takes advantage of CUDA IPC for intra-node (intra-I/OH)

communication (from CUDA 4.1)

• Provides flexibility in tuning performance of both RDMA

and shared-memory based designs based on predominant

message sizes in applications


OSU MPI Micro-Benchmarks (OMB) 3.5.1 Release

• OSU MPI Micro-Benchmarks provides a comprehensive suite of

benchmarks to compare performance of different MPI stacks

and networks

• Enhancements done for three benchmarks – Latency

– Bandwidth

– Bi-directional Bandwidth

• Flexibility for using buffers in NVIDIA GPU device (D) and host

memory (H)

• Flexibility for selecting data movement between D->D, D->H and

H->D

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack


http://mvapich.cse.ohio-state.edu/benchmarks




MVAPICH2 vs. OpenMPI (Device-Device, Inter-node)

0

500

1000

1500

2000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 2012) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bi-directional Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


BE

TT

ER

BE

TT

ER

BE

TT

ER


MVAPICH2 vs. OpenMPI (Device-Host, Inter-node)

0

500

1000

1500

2000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M


MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)



Host-Device

Performance is

Similar B

ET

TE

R

BE

TT

ER

BE

TT

ER


MVAPICH2 vs. OpenMPI (Device-Device, Intra-node, Multi-GPU)

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

5000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


0

2000

4000

6000

8000

1 16 256 4K 64K 1M


MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


BE

TT

ER

BE

TT

ER

BE

TT

ER



Applications-Level Evaluation (Lattice Boltzmann Method (LBM))

0

20

40

60

80

100

128x512x64 256x512x64 512x512x64 1024x512x64

Tim

e f

or

LB S

tep

(u

s)

Matrix Size XxYxZ

MVAPICH2 1.7

MVAPICH2 1.8a2

24.2% 24.2%

23.5%

23.5%

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios • NVIDIA Tesla C2050, Mellanox QDR InfiniBand HCA MT26428, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • Run one process on each node for one GPU (8-node cluster)


Application-Level Evaluation (AWP-ODC)

0

10

20

30

40

50

60

70

4 8

Tota

l Ex

ecu

tio

n T

ime

(se

c)

Number of Processes

MVAPICH2 1.7

MVAPICH2 1.8a2

• AWP-ODC simulates the dynamic rapture and wave propagation that occur during an earthquake. • A Gordon Bell Prize Finalist at SC 2010. • Originally a Fortran code, a new version is being written in C and CUDA. • NVIDIA Tesla C2050, Mellanox QDR IB, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • One process on each node with one GPU. 128x128x1024 data grid per process/GPU.

12.5% 13.0%


RC/XRC transport









• Partitioned Global Address Space (PGAS) models provide a

different complementary interface as compared to

message passing

– Idea is to decouple data movement with process synchronization

– Processes should have asynchronous access to globally distributed

data

– Well suited for irregular applications and kernels that require

dynamic access to different data

• Different libraries and compilers exist that provide this

model

– Global Arrays (library), UPC (compiler), CAF (compiler)

– HPCS languages: X10, Chapel, Fortress


PGAS Models

• Currently UPC and MPI do not share runtimes

– Duplication of lower level communication mechanisms

– GASNet unable to leverage advanced buffering mechanisms developed for MVAPICH2

• Our novel approach is to enable a truly unified communication library

Unifying UPC and MPI Runtimes: Experience with MVAPICH2

Network Interface

MPI Runtime, Buffers, Queue Pairs, and other

resources

GASNet Runtime, Buffers, Queue Pairs, and other

resources

MPI Interface GASNet Interface

UPC Compiler

MPI Interface

Network Interface

Unified MVAPICH + GASNet Runtime,

Buffers, Queue Pairs, and other resources

GASNet Interface

UPC Compiler


• BUPC micro-benchmarks from latest release 2.10.2

• UPC performance is identical with both native IBV layer and new UCR

layer

• Performance of GASNet-MPI conduit is not very good

– Mismatch of MPI specification and Active messages

• GASNet-UCR is more scalable compared native IBV conduit

0

2

4

6

8

10

1 2 4 8

16

32

64

128

256

512

1024

2048

Late

ncy

(u

s)


UPC Memput Latency


UPC Micro-benchmark Performance

GASNet-UCR GASNet-IBV GASNet-MPI

0

500

1000

1500

2000

2500

3000

3500

1 4

16 64

256

1K

4K

16K

64K

256K 1M

Ban

dw

idth

(M

Bp

s)


UPC Memput Bandwidth

200

210

220

230

240

250

260

270

16 32 64 128 256

Mem

ory

Fo

otp

rin

t (M

B)

Number of Processes

UPC Memory Scalability

J. Jose, M. Luo, S. Sur and D. K. Panda, “Unifying UPC and MPI Runtimes: Experience with MVAPICH”, International

Conference on Partitioned Global Address Space (PGAS), 2010

Evaluation using UPC NAS Benchmarks

• GASNet-UCR performs equal or better than GASNet-IBV

• 10% improvement for CG (B, 128)

• 23% improvement for MG (B, 128)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

Performance of MG, Class B and C

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128Class-Processes

Performance of FT, Class B and C

0

5

10

15

20

25

30

B-64 C-64 B-128 C-128Class-Processes

Performance of CG, Class B and C

GASNet-UCR GASNet-IBV GASNet-MPI


Evaluation of Hybrid MPI+UPC NAS-FT

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

GASNet-UCR

GASNet-IBV

GASNet-MPI

Hybrid

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• 34% improvement for FT (C, 128)


Graph500 Results with new UPC Queue Design

• Workload – Scale:24, Edge Factor:16 (16 million vertices, 256 million edges)

• 44% Improvement over base version for 512 UPC-Threads

• 30% Improvement over base version for 1024 UPC-Threads

44%

30%

J. Jose, S. Potluri, M. Luo, S. Sur and D. K. Panda, UPC Queues for Scalable Graph Traversals: Design and Evaluation

on InfiniBand Clusters, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct.

2011.


• Performance and Memory scalability toward 500K-1M cores

• Unified Support for PGAS Models and Languages (UPC, OpenShmem, etc.)

• Support for Hybrid Programming Models (being discussed in MPI 3.0)

• Enhanced Optimization for GPU Support and Accelerators – Extending the GPGPU support and adding Intel MIC support

• Taking advantage of Collective Offload framework in ConnectX-2 – Including support for non-blocking collectives (MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Enhanced Multi-rail Designs

• Automatic optimization of collectives – LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail

• Checkpoint-Restart and migration support with incremental checkpointing

• Fault-tolerance with run-through stabilization (being discussed in MPI 3.0)

• QoS-aware I/O and checkpointing

MVAPICH2 – Future Plans


• Scientific Computing

– Message Passing Interface (MPI) is the Dominant Programming

Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of

momentum

– Memcached is also used

51

Two Major Categories of Applications


Can High-Performance Interconnects Benefit Enterprise Computing?

• Beginning to draw interest from the enterprise domain

– Oracle, IBM, Google are working along these directions

• Performance in the enterprise domain remains a concern

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs?


Common Protocols using Open Fabrics

53

Application

Verbs Sockets Application

Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol Implementation

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload


Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols?

54

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface


Memcached Architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

• Native IB-verbs-level Design and evaluation with

– Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

55

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD


56

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)



Memcached Set Latency (Large Message)

• Memcached Get latency

– 8K bytes RC/UD – DDR: 19.8/21.2 us; QDR: 11.7/12.7 us

– 512K bytes RC/UD -- DDR: 366/413 us; QDR: 181/206 us

• Almost factor of two improvement over 10GE (TOE) for 512K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2K 4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (

us)

Message Size

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2K 4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB


Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients


Memcached - Memory Scalability

• Steady Memory Footprint for UD Design

– ~ 200MB

• RC Memory Footprint increases as increase in number of clients

– ~500MB for 4K clients

0

100

200

300

400

500

600

700

1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K

Me

mo

ry F

oo

tpri

nt

(MB

)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB OSU-Hybrid-IB


Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB


Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design

on High Performance RDMA Capable Interconnects, CCGrid’12

Overview of HBase Architecture

• An open source database project based on Hadoop framework for hosting very large tables

• Major components: HBaseMaster, HRegionServer and HBaseClient

• HBase and HDFS are deployed in the same cluster to get better data locality

62



HBase Put/Get – Detailed Analysis

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

0

50

100

150

200

250

300

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Put 1KB

0

50

100

150

200

250

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12


HBase Single Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

100

200

300

400

500

600

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

IPoIB

OSU-IB

1GigE

10GigE

Latency Throughput


HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

IPoIB OSU-IB

1GigE 10GigE

Read Latency Write Latency

J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

Hadoop Architecture

• Underlying Hadoop Distributed File System (HDFS)

• Fault-tolerance by replicating data blocks

• NameNode: stores information on data blocks

• DataNodes: store blocks and host Map-reduce computation

• JobTracker: track jobs and detect failure

• Model scales but high amount of communication during intermediate phases



RDMA-based Design for Native HDFS – Preliminary Results

• HDFS File Write Experiment using one data node on IB-DDR Cluster

• HDFS File Write Time

– 64 MB – 344ms, 128 MB – 669ms, 256 MB – 1.3s,

512 MB – 2.7s, 1GB – 6.7s

– 20% over IPoIB, 13% over 10GigE improvement for 1GB file

0

2000

4000

6000

8000

10000

12000

64 128 256 512 1024

Tim

e (

ms)

File Size (MB)

1GigE IPoIB

10GigE OSU-IB-DDR

• InfiniBand with RDMA feature is gaining momentum in HPC

systems with best performance and greater usage

• As the HPC community moves to Exascale, new solutions are

needed in the MPI stack for supporting PGAS, GPU, collectives

(topology-aware, power-aware), one-sided and QoS, etc.

• Demonstrated how such solutions can be designed with

MVAPICH2 and their performance benefits

• New solutions are also needed to re-design software libraries for

enterprise environments to take advantage of modern networks

• Allow application scientists and engineers to take advantage of

modern supercomputers

68

Concluding Remarks



Funding Acknowledgments

Funding Support by

Equipment Support by

69


Personnel Acknowledgments

Current Students

– V. Dhanraj (M.S.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– K. Kandalla (Ph.D.)

– M. Luo (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

– M. Rahman (Ph.D.)

– A. Singh (Ph.D.)

– H. Subramoni (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (Ph. D.)

– J. Liu (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– S. Pai (M.S.)

– G. Santhanaraman (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

70

Past Research Scientist – S. Sur

Current Post-Docs

– J. Vienne

– H. Wang

Current Programmers

– M. Arnold

– D. Bureddy

– J. Perkins

– D. Sharma

Past Post-Docs – X. Besseron

– H.-W. Jin

– E. Mancini

– S. Marcarelli


Web Pointers


http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

71





http://nowlab.cse.ohio-state.edu/







mailto:[email protected]



the future of supercomputer software libraries€¦ · high-end computing (hec): petaflop to...

Documents