mvapich2 project update and big data acceleration · optimized v2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64k...

37
MVAPICH2 Project Update and Big Data Acceleration Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Presentation at HPC Advisory Council European Conference 2012 by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon

Upload: others

Post on 24-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

MVAPICH2 Project Update and Big Data Acceleration

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Presentation at HPC Advisory Council European Conference 2012

by

Hari Subramoni

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

Page 2: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• MVAPICH Project Update

– 1.8 Release

– Sample Performance Numbers

– Upcoming Features and Performance Numbers

• Big Data Acceleration

– Motivation

– Solutions

• Memcached

• HBase

• HDFS

HPC Advisory - Hamburg 2

Presentation Outline

Page 3: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 2002

– Used by more than 1,930 organizations (HPC Centers, Industry and Universities) in

68 countries

– More than 114,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 7th ranked 111,104-core cluster (Pleiades) at NASA

• 25th ranked 62,976-core cluster (Ranger) at TACC

• and many others

– Available with software stacks of many InfiniBand, High-speed Ethernet and server

vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the upcoming U.S. NSF-TACC Stampede (10-15 PFlop) System

3

MVAPICH2 Software

HPC Advisory - Hamburg

Page 4: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• Released on 04/30/12

• Major features (Compared to MVAPICH2 1.7)

– Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-

GPU)

• Taking advantage of CUDA IPC (available in CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

– New shared memory design for enhanced intra-node small message performance

– Support suspend/resume functionality with mpirun_rsh

– Enhanced support for CPU binding with socket and numanode level granularity

– Checkpoint-Restart and run-through stabilization support in OFA-IB-Nemesis interface

– Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully

– Enhanced integration with SLURM and PBS

– Reduced memory footprint of the library

– Enhanced one-sided communication design with reduced memory requirement

– Enhancements and tuned collectives (Bcast and Alltoallv)

– Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters

– RoCE enable environment variable name is changed from MV2_USE_RDMAOE to MV2_USE_RoCE

– Update to hwloc v1.4

4

Latest MVAPICH2 1.8 – Features

HPC Advisory - Hamburg

Page 5: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 5

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 6: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 6

Bandwidth: MPI over IB

0

1000

2000

3000

4000

5000

6000

7000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/s

ec)

Message Size (bytes)

3280

3385

1917

1706

6343

-1000

1000

3000

5000

7000

9000

11000

13000

15000 MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/s

ec)

Message Size (bytes)

3341

3704

4407

11643

6521

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

Page 7: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• Supports all different modes: RC/XRC, UD, UD+RC or UD+XRC

• Available in MVAPICH (since 1.1)

• Available in MVAPICH2 (since 1.7) with more advanced, tuned and adaptive

designs

• Provides flexibility to tune performance of critical applications on large-scale

clusters with various modes

HPC Advisory - Hamburg

Hybrid Transport Design (UD-RC/XRC)

M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over

InfiniBand, IPDPS ‘08

7

Page 8: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• Experimental Platform : LLNL Hyperion Cluster (Intel Harpertown + QDR IB)

• RC: Default Parameters

• UD: MV2_HYBRID_ENABLE_THRESHOLD=1,

MV2_HYBRID_MAX_RC_CONNECTIONS=0

• Hybrid: MV2_HYBRID_ENABLE_THRESHOLD=1

HPC Advisory - Hamburg

HPCC Random Ring Latency with MVAPICH2 (UD/RC/Hybrid)

8

0

1

2

3

4

5

6

128 256 512 1024

Tim

e (

us)

No. of Processes

UD

Hybrid

RC

38%26% 40%30%

Page 9: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Shared-memory Aware Collectives

020406080

100120140160

0 4 8 16 32 64 128 256 512

La

ten

cy (

us

)

Message Size (bytes)

MPI_Reduce (4096 cores)

Original

Shared-memory

HPC Advisory - Hamburg

0

1000

2000

3000

4000

5000

0 8 32 128 512 2K 8K

La

ten

cy (

us

)

Message Size (bytes)

MPI_ Allreduce (4096 cores)

Original

Shared-memory

9

MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1

020406080

100120140

128 256 512 1024

Late

ncy

(u

s)

Number of Processes

Pt2Pt Barrier Shared-Mem Barrier

• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)

• MVAPICH2 Barrier with 1K Intel

Westmere cores , QDR IB

MV2_USE_SHMEM_BARRIER=0/1

Page 10: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 10

MVAPICH2 vs. OpenMPI (GPU Device- GPU Device, Inter-node)

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)

Message Size (bytes)

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)

Message Size (bytes)

MVAPICH2 1.8rc1 and OpenMPI (Trunk nightly snapshot on Mar 26, 2012)Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bi-directional Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)

Message Size (bytes)

BE

TT

ER

BE

TT

ER

BE

TT

ER

Page 11: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 11

Applications-Level Evaluation (Lattice Boltzmann Method (LBM))

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios• Node specification on Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory• CUDA 4.1 and MVAPICH2 1.8rc1• 16 nodes with one MPI process per node/GPU

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

LBM

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPIMPI-GPU

13.7%12.0%

11.8%

9.4%

Page 12: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• Performance and Memory scalability toward 500K-1M cores

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Enhanced Optimization for GPU Support and Accelerators

– Extending the GPGPU support

– Support for Intel MIC

• UD-Multicast-Aware Collectives

• Taking advantage of Collective Offload framework– Including support for non-blocking collectives (MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Enhanced Multi-rail Designs

• Automatic optimization of collectives– LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail

• Support for MPI Tools Interface

• Checkpoint-Restart and migration support with incremental checkpointing

• Fault-tolerance with run-through stabilization (being discussed in MPI 3.0)

• QoS-aware I/O and checkpointing

• Automatic tuning and self-adaptation for different systems and applications

MVAPICH2 – Plans for Exascale

12HPC Advisory - Hamburg

Page 13: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 13

Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs

Hybrid MPI+OpenSHMEM

• Based on OpenSHMEM Reference Implementation http://openshmem.org/

• Provides a design over GASNet

• Does not take advantage of all OFED features

• Design scalable and High-Performance OpenSHMEM over OFED

• Designing a Hybrid MPI +OpenSHMEM Model

• Current Model – Separate Runtimes for OpenSHMEM and MPI

• Possible deadlock if both runtimes are not progressed

• Consumes more network resource

• Our Approach – Single Runtime for MPI and OpenSHMEM

Hybrid (OpenSHMEM+MPI) Application

InfiniBand Network

OSU Design

OpenSHMEM

InterfaceMPI

Interface

OpenSHMEM

callsMPI calls

Page 14: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 14

Micro-Benchmark Performance (OpenSHMEM)

0

2

4

6

8

fadd finc cswap swap add inc

Tim

e (u

s)

OpenSHMEM-GASNetOpenSHMEM-OSU

0

2

4

6

8

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU41%

Atomic Operations Broadcast (256 bytes)

41%16%

0

2

4

6

8

10

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU35%

Collect (256 bytes)

0

4

8

12

16

32 64 128 256

Tim

e (m

s)

No. of Processes

OpenSHMEM-GASNet

OpenSHMEM-OSU36%

Reduce (256 bytes)

Page 15: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 15

Performance of Hybrid (OpenSHMEM+MPI) Applications

0

400

800

1200

1600

32 64 128 256 512

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

34%

0

2

4

6

8

32 64 128 256

Tim

e (s

)

No. of Processes

Hybrid-GASNet

Hybrid-OSU

2D Heat Transfer Modeling Graph-500

0

20

40

60

80

32 64 128 256 512

Res

ou

rce

Uti

lizat

ion

(M

B)

No. of Processes

Hybrid-GASNetHybrid-OSU

27%

45%

Network Resource Usage• Improved Performance for Hybrid Applications

• 34% improvement for 2DHeat Transfer Modeling with 512 processes

• 45% improvement for Graph500 with 256 processes

• Our approach with single Runtime consumes 27% lesser network resources

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and

Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

Will be available with MVAPICH2 1.9

Page 16: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Many Integrated Core (MIC) Architecture

• Intel’s Many Integrated Core (MIC) architecture geared for HPC

• Critical part of their solution for exascale computing

• Knights Ferry (KNF) and Knights Corner (KNC)

• Many low-power x86 processor cores with hardware threads and

wider vector units

• Applications and libraries can run with minor or no modifications

• Using MVAPICH2 for communication within a KNF coprocessor

– out-of-the box (Default)

– shared memory designs tuned for KNF (Optimized V1)

– designs enhanced using lower level API (Optimized V2)

16HPC Advisory - Hamburg

Page 17: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

17

Point-to-Point Communication

Knights Ferry - D0 1.20GHz card - 32 coresAlpha 9 MIC software stack with an additional pre-alpha patch

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512 2K 8K

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Latency

Default

Optimized V1

Optimized V2

13% 17%

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Latency

Default

Optimized V1

Optimized V2

70% 80%

0.0

0.2

0.4

0.6

0.8

1.0

1 4 16 64 256 1K 4K 16K

No

rma

lize

d B

an

dw

idth

Message Size (Bytes)

Bandwidth

Default

Optimized V1

Optimized V2

0.0

0.2

0.4

0.6

0.8

1.0

32K 128K 512K 2M

No

rma

lize

d B

an

dw

idth

Message Size (Bytes)

Bandwidth

Default

Optimized V1

Optimized V2

15%

4X

9.5X

HPC Advisory - Hamburg

Page 18: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

18

Collective Communication

0.0

0.2

0.4

0.6

0.8

1.0

0 2 8 32 128 512

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Broadcast

Default

Optimized V1

Optimized V2

17%

0.0

0.2

0.4

0.6

0.8

1.0

64K 128K 256K 512K 1M

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Scatter

Default

Optimized V1

Optimized V2

72% 80%

S. Potluri, K. Tomko, D. Bureddy and D. K. Panda – Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 2012 – Best Student Paper Award

0.0

0.2

0.4

0.6

0.8

1.0

2K 8K 32K 128K 512K

No

rma

lize

d L

ate

ncy

Message Size (Bytes)

Broadcast

Default

Optimized V1

Optimized V2

20% 26%

HPC Advisory - Hamburg

Page 19: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 19

Hardware Multicast-aware MPI_Bcast

0

2

4

6

8

10

12

14

16

32 64 128 256 512 1024

Tim

e (

us)

Number of Processes

16 byte MPI_Bcast

Default

Multicast

0

50

100

150

200

250

300

350

32 64 128 256 512 1024

Tim

e (

us)

Number of Processes

64KB MPI_Bcast

Default

Multicast

0

500

1000

1500

2000

2500

2K 8K 32K 128K 512K

Tim

e (

us)

Message Size (Bytes)

Large Messages (1,024 Cores)

Default

Multicast

0

5

10

15

20

25

1 4 16 64 256 1K

Tim

e (

us)

Message Size (Bytes)

Small Messages (1,024 Cores)

Default

Multicast

Initial support was available in MVAPICH, enhanced will be available in MVAPICH2 1.9

Page 20: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Application Benefits with Non-Blocking Collectives based on CX-2 Collective Offload

20

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

• K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D.

K. Panda, High-Performance and Scalable Non-Blocking All-to-All

with Collective Offload on InfiniBand Clusters: A Study with

Parallel 3D FFT, ISC 2011

HPC Advisory - Hamburg

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

Nor

mal

ized

Perf

orm

ance

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

21.8%

• K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K.

Panda, Designing Non-blocking Broadcast with Collective Offload

on InfiniBand Clusters: A Case Study with HPL, HotI 2011

• K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

Page 21: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg

Topology-Aware Collectives

Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes

21

K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, Designing Topology-Aware Collective Communication

Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather, CAC ‘10

0

100

200

300

400

500

2K

4K

8K

16

K

32

K

64

K

12

K

25

6K

51

2K

12

8K

Lat

enc

y (m

sec)

Message Size (Bytes)

Scatter-Default

Scatter-Topo-Aware

0100200300400500600700

2K

4K

8K

16

K

32K

64

K

12

8K

256K

51

2K

12

8K

Lat

enc

y(m

sec)

Message Size (Bytes)

Gather-Default

Gather-Topo-Aware22%54%

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design

and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11

14 %

Impact of Network-Topology Aware Algorithms on Broadcast Performance

Page 22: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

22

Power and Energy Savings with Power-Aware CollectivesPerformance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, Designing Power Aware Collective Communication

Algorithms for InfiniBand Clusters, ICPP ‘10HPC Advisory - Hamburg

CPMD Application Performance and Power Savings:

5%32%

Page 23: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

• MVAPICH Project Update

– 1.8 Release

– Sample Performance Numbers

– Upcoming Features and Performance Numbers

• Big Data Acceleration

– Motivation

– Solutions

• Memcached

• HBase

• HDFS

HPC Advisory - Hamburg 23

Presentation Outline

Page 24: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Can High-Performance Interconnects Benefit Big Data Acceleration?

• Beginning to draw interest from the enterprise domain

– Oracle, IBM, Google are working along these directions

• Performance in the enterprise domain remains a concern

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs?

24HPC Advisory - Hamburg

Page 25: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Common Protocols using Open Fabrics

25

Application

VerbsSocketsApplication

Interface

SDP

RDMA

SDP

InfiniBandAdapter

InfiniBandSwitch

RDMA

IB Native

InfiniBandAdapter

InfiniBandSwitch

User

space

RDMA

RoCE

RoCEAdapter

User

space

EthernetSwitch

TCP/IP

EthernetDriver

Kernel Space

ProtocolImplementation

InfiniBandAdapter

InfiniBandSwitch

IPoIB

EthernetAdapter

EthernetSwitch

NetworkAdapter

NetworkSwitch

1/10/40 GigE

iWARP

EthernetSwitch

iWARP

iWARPAdapter

User

spaceIPoIB

TCP/IP

EthernetAdapter

EthernetSwitch

10/40 GigE-TOE

HardwareOffload

ISC '12

Page 26: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols?

26

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigENetwork

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPC Advisory - Hamburg

Page 27: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Memcached Architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

• Native IB-verbs-level Design and evaluation with

– Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

27

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

ISC '12

Page 28: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

28

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)

HPC Advisory - Hamburg

Page 29: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 29

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Th

ou

san

ds

of

Tra

nsa

ctio

ns

pe

r se

con

d (T

PS)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

an

sact

ion

s p

er

seco

nd

(TP

S)

No. of Clients

Page 30: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 30

Application Level Evaluation –Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients• Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (m

s)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design

on High Performance RDMA Capable Interconnects, CCGrid’12

Page 31: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Overview of HBase Architecture

• An open source database project based on Hadoop framework for hosting very large tables

• Major components: HBaseMaster, HRegionServer and HBaseClient

• HBase and HDFS are deployed in the same cluster to get better data locality

31HPC Advisory - Hamburg

Page 32: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 32

HBase Put/Get – Detailed Analysis

• HBase 1KB Put– Communication Time – 8.9 us– A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get– Communication Time – 8.9 us– A factor of 6X improvement over 10GE for communication time

0

50

100

150

200

250

300

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Put 1KB

0

50

100

150

200

250

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12

Page 33: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 33

HBase Single Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

100

200

300

400

500

600

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

IPoIB

OSU-IB

1GigE

10GigE

Latency Throughput

Page 34: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 34

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

IPoIB OSU-IB

1GigE 10GigE

Read Latency Write Latency

J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

Page 35: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

Hadoop Architecture

• Underlying Hadoop Distributed File System (HDFS)

• Fault-tolerance by replicating data blocks

• NameNode: stores information on data blocks

• DataNodes: store blocks and host Map-reduce computation

• JobTracker: track jobs and detect failure

• Model scales but high amount of communication during intermediate phases

35ISC '12

Page 36: MVAPICH2 Project Update and Big Data Acceleration · Optimized V2 17% 0.0 0.2 0.4 0.6 0.8 1.0 64K 128K 256K 512K 1M y Message Size (Bytes) Scatter Default Optimized V1 Optimized V2

HPC Advisory - Hamburg 36

RDMA-based Design for Native HDFS –Preliminary Results

• HDFS (Hadoop-0.20.2) File Write Experiment using four data nodes on IB-QDR Cluster

• HDFS File Write Time

– 2 GB – 7.78s

– For 2 GB File Size - 26% improvement over IPoIB

• In Communication, for 3 GB File Size, 36% improvement over IPoIB, 87%

improvement over 1GigE

0

5

10

15

20

25

1 2 3 4 5

Tim

e (

s)

FileSize (GB)

Communication Time

1GigE

IPoIB

OSU-Design

0

10

20

30

40

50

60

1 2 3 4 5

Tim

e (

s)

File Size (GB)

File Write Time (SSD)

1GigE

IPoIB

OSU-design