new developments in message passing, pgas and big data–hadoop (hdfs, hbase, mapreduce) environment...

90
New Developments in Message Passing, PGAS and Big Data Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council China Workshop (2013) by

Upload: others

Post on 31-Dec-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

New Developments in Message Passing, PGAS and Big Data

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council China Workshop (2013)

by

Page 2: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

2

Welcome

欢迎

HPC Advisory Council China Workshop, Oct '13

Page 3: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Growth of High Performance Computing

(HPC)

– Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking

• Increase in speed/features + reducing cost

HPC Advisory Council China Workshop, Oct '13

Current and Next Generation HPC Systems and Applications

3

Page 4: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of

momentum

– Memcached is also used for Web 2.0

4

Two Major Categories of Applications

HPC Advisory Council China Workshop, Oct '13

Page 5: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

High-End Computing (HEC): PetaFlop to ExaFlop

HPC Advisory Council China Workshop, Oct '13 5

Expected to have an ExaFlop system in 2019 -2022!

100

PFlops in

2015

1 EFlops in

2018

Page 6: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

6

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

HPC Advisory Council China Workshop, Oct '13

Page 7: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters •

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

7

Multi-core Processors

HPC Advisory Council China Workshop, Oct '13

Page 8: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Exascale System Targets

Systems 2010 2018 Difference 2010 & 2018

System peak 2 PFlop/s 1 EFlop/s O(1,000)

Power 6 MW ~20 MW (goal)

System memory 0.3 PB 32 – 64 PB O(100)

Node performance 125 GF 1.2 or 15 TF O(10) – O(100)

Node memory BW 25 GB/s 2 – 4 TB/s O(100)

Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)

Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)

O(100)

System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]

O(10,000)

Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)

O(10) – O(100)

IO Rates 0.2 TB 60 TB/s O(100)

MTTI Days O(1 day) -O(10)

Courtesy: DOE Exascale Study and Prof. Jack Dongarra

8 HPC Advisory Council China Workshop, Oct '13

Page 9: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• DARPA Exascale Report – Peter Kogge, Editor and Lead

• Energy and Power Challenge

– Hard to solve power requirements for data movement

• Memory and Storage Challenge

– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge

– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge

– Low voltage devices (for low power) introduce more faults

HPC Advisory Council China Workshop, Oct '13 9

Basic Design Challenges for Exascale Systems

Page 10: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 40/100GigE,

Aries, BlueGene)

Multi/Many-core Architectures

Point-to-point Communication (two-sided & one-sided)

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Communication Library or Runtime for Programming Models

10

Accelerators (NVIDIA and MIC)

Middleware

HPC Advisory Council China Workshop, Oct '13

Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Page 11: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Message Passing Interface (MPI)

• Partitioned Global Address Space (PGAS)

– Hybrid Programming: MPI + PGAS

• Big Data

– Hadoop

– Memcached (Web 2.0)

11

New Developments and Challenges

HPC Advisory Council China Workshop, Oct '13

Page 12: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 12

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• In this presentation, we concentrate on MPI first, then on

PGAS and Hybrid MPI+PGAS

Page 13: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Message Passing Library standardized by MPI Forum

– C and Fortran

• Goal: portable, efficient and flexible standard for writing

parallel applications

• Not IEEE or ISO standard, but widely considered “industry

standard” for HPC application

• Evolution of MPI

– MPI-1: 1994

– MPI-2: 1996

– MPI-3.0: 2008 – 2012, standardized before SC ‘12

– Next plans for MPI 3.1, 3.2, ….

13

MPI Overview and History

HPC Advisory Council China Workshop, Oct '13

Page 14: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

Major MPI Features

14 HPC Advisory Council China Workshop, Oct '13

Page 15: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Power required for data movement operations is one of

the main challenges

• Non-blocking collectives

– Overlap computation and communication

• Much improved One-sided interface

– Reduce synchronization of sender/receiver

• Manage concurrency

– Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenSHMEM)

• Resiliency

– New interface for detecting failures

HPC Advisory Council China Workshop, Oct '13 15

How does MPI Plan to Meet Exascale Challenges?

Page 16: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Major features

– Non-blocking Collectives

– Improved One-Sided (RMA) Model

– MPI Tools Interface

• Specification is available from: http://www.mpi-

forum.org/docs/mpi-3.0/mpi30-report.pdf

HPC Advisory Council China Workshop, Oct '13 16

Major New Features in MPI-3

Page 17: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Enables overlap of computation with communication

• Removes synchronization effects of collective operations

(exception of barrier)

• Non-blocking calls do not match blocking collective calls

– MPI implementation may use different algorithms for blocking and

non-blocking collectives

– Blocking collectives: optimized for latency

– Non-blocking collectives: optimized for overlap

• User must call collectives in same order on all ranks

• Progress rules are same as those for point-to-point

• Example new calls: MPI_Ibarrier, MPI_Iallreduce, …

HPC Advisory Council China Workshop, Oct '13 17

Non-blocking Collective Operations

Page 18: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 18

Improved One-sided (RMA) Model Process

0

Process

2

Process

1

Process

3

mem

mem

mem

mem

Process 0

Private

Window

Public

Window

Incoming RMA Ops

Synchronization

Local Memory Ops

Window

• New RMA proposal has major improvements

• Easy to express irregular communication pattern

• Better overlap of communication & computation

• MPI-2: public and private windows

– Synchronization of windows explicit

• MPI-2: works for non-cache coherent systems

• MPI-3: two types of windows

– Unified and Separate

– Unified window leverages hardware cache coherence

Page 19: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 19

MPI Tools Interface

• Extended tools support in MPI-3, beyond the PMPI interface

• Provide standardized interface (MPIT) to access MPI internal

information • Configuration and control information

• Eager limit, buffer sizes, . . .

• Performance information

• Time spent in blocking, memory usage, . . .

• Debugging information

• Packet counters, thresholds, . . .

• External tools can build on top of this standard interface

Page 20: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

20

Partitioned Global Address Space (PGAS) Models

HPC Advisory Council China Workshop, Oct '13

• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

- Libraries

• OpenSHMEM

• Global Arrays

• Chapel

Page 21: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• UPC: a parallel extension to the C standard

• UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999

– UPC Language Specifications, v1.0, Feb 2001

– UPC Language Specifications, v1.1.1, Sep 2004

– UPC Language Specifications, v1.2, 2005

– UPC Language Specifications, v1.3, In Progress - Draft Available

• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland…

– Government Institutions: ARSC, IDA, LBNL, SNL, US DOE…

– Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …

• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC

– Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC

• Aims for: high performance, coding efficiency, irregular applications, …

HPC Advisory Council China Workshop, Oct '13 21

Compiler-based: Unified Parallel C

Page 22: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

OpenSHMEM

• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP

SHMEM, GSHMEM

• Subtle differences in API, across versions – example:

SGI SHMEM Quadrics SHMEM Cray SHMEM

Initialization start_pes(0) shmem_init start_pes

Process ID _my_pe my_pe shmem_my_pe

• Made applications codes non-portable

• OpenSHMEM is an effort to address this:

“A new, open specification to consolidate the various extant SHMEM versions

into a widely accepted standard.” – OpenSHMEM Specification v1.0

by University of Houston and Oak Ridge National Lab

SGI SHMEM is the baseline

HPC Advisory Council China Workshop, Oct '13 22

Page 23: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Hierarchical architectures with multiple address spaces

• (MPI + PGAS) Model

– MPI across address spaces

– PGAS within an address space

• MPI is good at moving data between address spaces

• Within an address space, MPI can interoperate with other shared

memory programming models

• Applications can have kernels with different communication patterns

• Can benefit from different models

• Re-writing complete applications can be a huge effort

• Port critical kernels to the desired model instead

HPC Advisory Council China Workshop, Oct '13 23

MPI+PGAS for Exascale Architectures and Applications

Page 24: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC Application

Kernel 2 PGAS

Kernel N PGAS

HPC Advisory Council China Workshop, Oct '13 24

Page 25: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Support for GPGPUs and Accelerators

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

– Power-aware

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)

Challenges in Designing (MPI+X) at Exascale

25 HPC Advisory Council China Workshop, Oct '13

Page 26: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA

over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,085 organizations (HPC Centers, Industry and Universities) in 71

countries

– More than 193,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 6th ranked 462,462-core cluster (Stampede) at TACC

• 19th ranked 125,980-core cluster (Pleiades) at NASA

• 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others

– Available with software stacks of many IB, HSE, and server vendors including

Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

26 HPC Advisory Council China Workshop, Oct '13

Page 27: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Multicore-aware and Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Support for GPGPUs

• Support for Intel MICs

• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

27 HPC Advisory Council China Workshop, Oct '13

Page 28: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 28

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 29: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 29

Bandwidth: MPI over IB

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

0

5000

10000

15000

20000

25000 MVAPICH-Qlogic-DDRMVAPICH-Qlogic-QDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX2-PCIe2-QDRMVAPICH-ConnectX3-PCIe3-FDRMVAPICH2-Mellanox-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

Page 30: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

30

Latest MVAPICH2 2.0a

Intel Sandy-bridge

0.19 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

12,000MB/s 12,000MB/s

HPC Advisory Council China Workshop, Oct '13

Page 31: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)

• NAMD performance improves when there is frequent communication to many peers

HPC Advisory Council China Workshop, Oct '13

eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)

31

• Both UD and RC/XRC have benefits

• Hybrid for the best of both

• Available since MVAPICH2 1.7 as integrated interface

• Runtime Parameters: RC - default;

• UD - MV2_USE_ONLY_UD=1

• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable

Connection,” Cluster ‘08

0

100

200

300

400

500

1 4 16 64 256 1K 4K 16KMe

mo

ry (

MB

/pro

cess

)

Connections

MVAPICH2-RC

MVAPICH2-XRC

0

0.5

1

1.5

apoa1 er-gre f1atpase jac

No

rmal

ize

d T

ime

Dataset

MVAPICH2-RC MVAPICH2-XRC

0

2

4

6

128 256 512 1024

Tim

e (

us)

Number of Processes

UD Hybrid RC

26% 40% 30% 38%

Page 32: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Multicore-aware and Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Support for GPGPUs

• Support for Intel MICs

• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

32 HPC Advisory Council China Workshop, Oct '13

Page 33: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 33

Hardware Multicast-aware MPI_Bcast on Stampede

05

10152025303540

2 8 32 128 512

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages (102,400 Cores)

Default

Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

050

100150200250300350400450

2K 8K 32K 128K

Late

ncy

(u

s)

Message Size (Bytes)

Large Messages (102,400 Cores)

Default

Multicast

0

5

10

15

20

25

30

Late

ncy

(u

s)

Number of Nodes

16 Byte Message

Default

Multicast

0

50

100

150

200

Late

ncy

(u

s)

Number of Nodes

32 KByte Message

Default

Multicast

Page 34: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Shared-memory Aware Collectives

0

20

40

60

80

100

120

140

160

0 4 8 16 32 64 128 256 512

La

ten

cy (

us

)

Message Size (Bytes)

MPI_Reduce (4096 cores)

Original

Shared-memory

HPC Advisory Council China Workshop, Oct '13

0500

10001500200025003000350040004500

0 8 32 128 512 2K 8K

La

ten

cy (

us

)

Message Size (Bytes)

MPI_ Allreduce (4096 cores)

Original

Shared-memory

34

MV2_USE_SHMEM_REDUCE=0/1 MV2_USE_SHMEM_ALLREDUCE=0/1

020406080

100120140

128 256 512 1024

Late

ncy

(u

s)

Number of Processes

Pt2Pt Barrier Shared-Mem Barrier

• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)

• MVAPICH2 Barrier with 1K Intel

Westmere cores , QDR IB

MV2_USE_SHMEM_BARRIER=0/1

Page 35: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

0

1

2

3

4

5

512 600 720 800Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload

35

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

HPC Advisory Council China Workshop, Oct '13

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective

Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12

Page 36: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 36

Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable

InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper

Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores

• 15% improvement in Hypre execution time @ 1024 cores

Page 37: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

37

Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for

Infiniband Clusters”, ICPP ‘10

CPMD Application Performance and Power Savings

5% 32%

HPC Advisory Council China Workshop, Oct '13

Page 38: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Multicore-aware and Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Support for GPGPUs

• Support for Intel MICs

• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

38 HPC Advisory Council China Workshop, Oct '13

Page 39: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Collaboration between Mellanox and

NVIDIA to converge on one memory

registration technique

• Both devices register a common

host buffer

– GPU copies data to this buffer, and the network

adapter can directly read from this buffer (or

vice-versa)

• Note that GPU-Direct does not allow you to

bypass host memory

• Latest GPUDirect RDMA achieves this

HPC Advisory Council China Workshop, Oct '13 39

GPU-Direct

Page 40: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver: MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

•Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance

40 HPC Advisory Council China Workshop, Oct '13

Page 41: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity

HPC Advisory Council China Workshop, Oct '13 41

Page 42: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

At Sender:

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

• High Performance and High Productivity

42 HPC Advisory Council China Workshop, Oct '13

MPI_Send(s_device, size, …);

Page 43: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

MPI-Level Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)

Message Size (bytes)

Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

43 HPC Advisory Council China Workshop, Oct '13

Better

Page 44: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

MVAPICH2 1.8, 1.9 and 2.0a Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

44 HPC Advisory Council China Workshop, Oct '13

Page 45: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 45

Applications-Level Evaluation

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios •one process/GPU per node, 16 nodes

• AWP-ODC (Courtesy: Yifeng Cui, SDSC) • a seismic modeling code, Gordon Bell Prize finalist at SC 2010 • 128x256x512 data grid per process, 8 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

0

20

40

60

80

100

120

140

256*256*256 512*256*256 512*512*256 512*512*512

Ste

p T

ime

(S)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

0102030405060708090

1 GPU/Proc per Node 2 GPUs/Procs per NodeTota

l Ex

ecu

tio

n T

ime

(se

c)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

LBM-CUDA

Page 46: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Fastest possible communication

between GPU and other PCI-E

devices

• Network adapter can directly read

data from GPU device memory

• Avoids copies through the host

• Allows for better asynchronous

communication

HPC Advisory Council China Workshop, Oct '13 46

GPU-Direct RDMA with CUDA 5

InfiniBand

GPU

GPU Memory

CPU

Chip set

System Memory

MVAPICH2 with GDR support under development

MVAPICH2-GDR Alpha is available for users

Page 47: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Peer2Peer (P2P) bottlenecks on

Sandy Bridge

• Design of MVAPICH2

– Hybrid design

– Takes advantage of GPU-Direct-

RDMA for writes to GPU

– Uses host-based buffered design in

current MVAPICH2 for reads

– Works around the bottlenecks

transparently

47

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA

HPC Advisory Council China Workshop, Oct '13

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

Page 48: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

0

5

10

15

20

25

30

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

48

Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

900

1000

16K 64K 256K 1M 4M

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Better

Better

19.1

67.5 %

6.2

HPC Advisory Council China Workshop, Oct '13

Page 49: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

0

1000

2000

3000

4000

5000

6000

7000

8K 32K 128K 512K 2M

MVAPICH2-1.9MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bandwidth

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bandwidth

49

Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Uni-Directional Bandwidth

2.8x 33%

Bet

ter

Bet

ter

HPC Advisory Council China Workshop, Oct '13

Page 50: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Large Message Bi-Bandwidth

0

100

200

300

400

500

600

700

800

900

1000

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

Small Message Bi-Bandwidth

50

Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Bi-directional Bandwidth

54%

Bet

ter

Bet

ter

HPC Advisory Council China Workshop, Oct '13

3x

Page 51: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Getting Started with GDR Experimentation?

• Two modules are needed

– GPUDirect RDMA (GDR) driver from Mellanox

– MVAPICH2-GDR from OSU

• Send a note to [email protected]

• You will get alpha versions of GDR driver and MVAPICH2-GDR (based on MVAPICH2 2.0a release)

• You can get started with this version

• MVAPICH2 team is working on multiple enhancements (collectives, datatypes, one-sided) to exploit the advantages of GDR

• As GDR driver matures, successive versions of MVAPICH2-GDR with enhancements will be made available to the community

51 HPC Advisory Council China Workshop, Oct '13

Page 52: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Multicore-aware and Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Support for GPGPUs

• Support for Intel MICs

• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

52 HPC Advisory Council China Workshop, Oct '13

Page 53: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

53 HPC Advisory Council China Workshop, Oct '13

Page 54: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PC

Ie

MIC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-Socket

2. Inter-Socket

3. Inter-Node

4. Intra-MIC

5. Intra-Socket MIC-MIC

6. Inter-Socket MIC-MIC

7. Inter-Node MIC-MIC

8. Intra-Socket MIC-Host

10. Inter-Node MIC-Host

9. Inter-Socket MIC-Host

MPI Process

54

11. Inter-Node MIC-MIC with IB adapter on remote socket

and more . . .

• Critical for runtimes to optimize data movement, hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

HPC Advisory Council China Workshop, Oct '13

Page 55: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

55

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only Mode

• Symmetric Mode

• Internode Communication

• Coprocessors-only

• Symmetric Mode

• Multi-MIC Node Configurations

• Comparison with Intel’s MPI Library

HPC Advisory Council China Workshop, Oct '13

Page 56: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

MIC-RemoteHost (Closer to HCA) Point-to-Point Communication

56 HPC Advisory Council China Workshop, Oct '13

osu_latency (small)

osu_bw osu_bibw

Bette

r B

ette

r

Bet

ter

Bette

r

osu_latency (large)

0

5

10

15

0 2 8 32 128 512 2K

La

ten

cy (

use

c)

Message Size (Bytes)

MV2

MV2-MIC

0

500

1000

1500

2000

2500

3000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

0

2000

4000

6000

8000

10000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

5681 8790

Page 57: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Inter-Node Symmetric Mode Collective Communication

osu_scatter (small)

osu_allgather (small) osu_allgather (large)

Bette

r

57

Bette

r

osu_scatter (large)

HPC Advisory Council China Workshop, Oct '13

Bette

r

Bette

r

8 Nodes with 8 Procs/Host and 8 Procs/MIC

0

20

40

60

80

100

120

1 4 16 64 256 1K 4K

La

ten

cy (

use

c)

Message Size (Bytes)

MV2

MV2-MIC

0

5000

10000

15000

20000

25000

8K 32K 128K 512K

La

ten

cy (

use

c)

Message Size (Bytes)

0

1000

2000

3000

4000

5000

1 4 16 64 256 1K 4K

La

ten

cy (

use

c)

Message Size (Bytes)

0100000200000300000400000500000600000700000800000

8K 32K 128K 512K

La

ten

cy (

use

c)

Message Size (Bytes)

Page 58: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

InterNode – Symmetric Mode - P3DFFT Performance

58 HPC Advisory Council China Workshop, Oct '13

• To explore different threading levels for host and Xeon Phi

3DStencil Comm. Kernel

P3DFFT – 512x512x4K

0

2

4

6

8

10

4+4 8+8

Tim

e p

er S

tep

(se

c)

Processes on Host+MIC (32 Nodes)

0

5

10

15

20

2+2 2+8Co

mm

per

Ste

p (

sec)

Processes on Host-MIC - 8 Threads/Proc

(8 Nodes)

50.2%

MV2-INTRA-OPT MV2-MIC

16.2%

Page 59: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

59

MVAPICH2 and Intel MPI – Intra-MIC

0

2

4

6

8

10

12

14

0 2 8 32 128 512 2K

La

ten

cy (

use

c)

Message Size (Bytes)

Small Message Latency

Intel MPI

MVAPICH2-MIC

0

500

1,000

1,500

2,000

2,500

4K 16K 64K 256K 1M 4M

La

ten

cy (

use

c)

Message Size (Bytes)

Large Message Latency

Intel MPI

MVAPICH2-MIC

0

2,000

4,000

6,000

8,000

10,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bandwidth

Intel MPI

MVAPICH2-MIC

02,0004,0006,0008,000

10,00012,00014,00016,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bi-directional Bandwidth

Intel MPI

MVAPICH2-MIC

Better

Better

Bet

ter

Bet

ter

HPC Advisory Council China Workshop, Oct '13

Page 60: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

60

MVAPICH2 and Intel MPI – Intranode – Host-MIC

0

2

4

6

8

10

0 2 8 32 128 512 2K

La

ten

cy (

use

c)

Message Size (Bytes)

Small Message Latency

Intel MPI

MVAPICH2-MIC

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bandwidth

Intel MPI

MVAPICH2-MIC

01,0002,0003,0004,0005,0006,0007,0008,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bi-directional Bandwidth

Intel MPI

MVAPICH2-MIC

0

200

400

600

800

1,000

1,200

4K 16K 64K 256K 1M 4M

La

ten

cy (

use

c)

Message Size (Bytes)

Large Message Latency

Intel MPI

MVAPICH2-MICBetter

Better

Bet

ter

Bet

ter

HPC Advisory Council China Workshop, Oct '13

Page 61: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

61

MVAPICH2 and Intel MPI – Internode MIC-MIC

02468

10121416

0 2 8 32 128 512 2K

La

ten

cy (

use

c)

Message Size (Bytes)

Small Message Latency

Intel MPI

MVAPICH2-MIC

0200400600800

1,0001,2001,4001,600

4K 16K 64K 256K 1M 4M

La

ten

cy (

use

c)

Message Size (Bytes)

Large Message Latency

Intel MPI

MVAPICH2-MIC

0

2,000

4,000

6,000

8,000

10,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bi-directional Bandwidth

Intel MPI

MVAPICH2-MIC

0

1,000

2,000

3,000

4,000

5,000

6,000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

ps)

Message Size (Bytes)

Bandwidth

Intel MPI

MVAPICH2-MIC

Better

Better

Bet

ter

Bet

ter

HPC Advisory Council China Workshop, Oct '13

Page 62: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Collective communication – Multicore-aware and Hardware-multicast-based

– Offload and Non-blocking

– Topology-aware

– Power-aware

• Support for GPGPUs

• Support for Intel MICs

• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

62 HPC Advisory Council China Workshop, Oct '13

Page 63: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

MVAPICH2-X for Hybrid MPI + PGAS Applications

HPC Advisory Council China Workshop, Oct '13 63

MPI Applications, OpenSHMEM Applications, UPC

Applications, Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Class UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM available with

MVAPICH2-X 1.9 onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant

– Scalable Inter-node and intra-node communication – point-to-point and collectives

Page 64: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 64

Micro-Benchmark Performance (OpenSHMEM)

0

2

4

6

8

fadd finc cswap swap add inc

Tim

e (u

s)

OpenSHMEM-GASNetOpenSHMEM-OSU

Atomic Operations Broadcast (256 processes)

41% 16%

Collect (256 processes) Reduce (256 processes)

1

10

100

1000

10000

100000

1000000

10000000

Tim

e (

us)

Size (bytes)

OpenSHMEM-OSU-1.9OpenSHMEM-OSU-2.0a

1

10

100

1000

10000

100000

1000000

10000000

Tim

e (

us)

Size (bytes)

OpenSHMEM-OSU-1.9

OpenSHMEM-OSU-2.0a

1

10

100

1000

10000

100000

Tim

e (

us)

Size(bytes)

OpenSHMEM-OSU-1.9

OpenSHMEM-OSU-2.0a

Page 65: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 65

Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

0

1

2

3

4

5

6

7

8

9

26 27 28 29

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

Scale

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

0

1

2

3

4

5

6

7

8

9

1024 2048 4096 8192

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance

Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 8,192 processes - 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes - 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple

Page 66: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

66

Hybrid MPI+OpenSHMEM Sort Application

Execution Time

Strong Scalability

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• Execution Time

- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds

- 51% improvement over MPI-based design

• Strong Scalability (configuration: constant input size of 500GB)

- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min

- 55% improvement over MPI based design

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

512 1,024 2,048 4,096

Sort

Rat

e (

TB/m

in)

No. of Processes

MPI

Hybrid

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (

seco

nd

s)

Input Data - No. of Processes

MPI

Hybrid

51% 55%

HPC Advisory Council China Workshop, Oct '13

Page 67: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Hybrid MPI+UPC NAS-FT

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• For FT (Class C, 128 processes)

• 34% improvement over UPC-GASNet

• 30% improvement over UPC-OSU

67 HPC Advisory Council China Workshop, Oct '13

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (

s)

NAS Problem Size – System Size

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth

Conference on Partitioned Global Address Space Programming Model (PGAS ’10), October 2010

Page 68: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Enhanced Optimization for GPU Support and Accelerators – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.0 and Beyond

– Support for Intel MIC (Knight Landing)

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Support for MPI Tools Interface (as in MPI 3.0)

• Checkpoint-Restart and migration support with in-memory checkpointing

• Hybrid MPI+PGAS programming support with GPGPUs and Accelertors

MVAPICH2/MVPICH2-X – Plans for Exascale

68 HPC Advisory Council China Workshop, Oct '13

Page 69: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of

momentum

– Memcached is also used for Web 2.0

69

Two Major Categories of Applications

HPC Advisory Council China Workshop, Oct '13

Page 70: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Big Data Applications and Analytics

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities for

enterprise information management and

decision making

• The amount of data is exploding; companies

are capturing and digitizing more information

that ever

• The rate of information growth appears to be

exceeding Moore’s Law

• 35 Zettabytes of data will be generated and

consumed by the end of this decade

70 HPC Advisory Council China Workshop, Oct '13

• The Meaning of Big Data - 3 V’s • Big Volume • Big Velocity • Big Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

Page 71: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Can High-Performance Interconnects Benefit Big Data Processing?

• Beginning to draw interest from the enterprise domain

– Oracle, IBM, Google are working along these directions

• Performance in the enterprise domain remains a concern

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs?

71 HPC Advisory Council China Workshop, Oct '13

Page 72: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

72

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPC Advisory Council China Workshop, Oct '13

Page 73: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• Framework includes: MapReduce, HDFS, and HBase

• Underlying Hadoop Distributed File System (HDFS) used by MapReduce and HBase

• Model scales but high amount of communication during intermediate phases

73 HPC Advisory Council China Workshop, Oct '13

Page 74: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand support at the verbs-level for

HDFS, MapReduce, and RPC components

– Easily configurable for both native InfiniBand and the traditional sockets-based

support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.1

– Based on Apache Hadoop 0.20.2

– A version based on Apache Hadoop 1.2.1 will be released soon

– Compliant with Apache Hadoop 0.20.2 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hadoop-rdma.cse.ohio-state.edu

Hadoop-RDMA Project

74 HPC Advisory Council China Workshop, Oct '13

Page 75: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Reducing Communication Time with HDFS-RDMA Design

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

75 HPC Advisory Council China Workshop, Oct '13

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

Page 76: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Evaluations using TestDFSIO for HDFS-RDMA Design

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

76 HPC Advisory Council China Workshop, Oct '13

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

Page 77: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Performance of Sort using Map-Reduce-RDMA with HDD

• With 4 HDD DataNodes for 20GB sort

– 26% improvement over IPoIB (QDR)

– 38% improvement over Hadoop-A-IB

(QDR)

4 HDD DataNodes 8 HDD DataNodes

0

200

400

600

800

1000

1200

1400

5 10 15 20

Job

Exe

cuti

on

Tim

e (

sec)

Sort Size (GB)

1GigE IPoIB (QDR)

Hadoop-A-IB (QDR) OSU-IB (QDR)

0

200

400

600

800

1000

1200

1400

25 30 35 40

Job

Exe

cuti

on

Tim

e (

sec)

Sort Size (GB)

1GigE IPoIB (QDR)

Hadoop-A-IB (QDR) OSU-IB (QDR)

• With 8 HDD DataNodes for 40GB sort

– 24% improvement over IPoIB (QDR)

– 32% improvement over Hadoop-A-IB

(QDR)

77 HPC Advisory Council China Workshop, Oct '13

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-

based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May

2013.

Page 78: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

TeraSort Performance with Map-Reduce-RDMA

• 100GB TeraSort with 12 DataNodes and 200GB TeraSort in 24 DataNodes

– 41% improvement over IPoIB (QDR)

– 7% improvement over Hadoop-A-IB (QDR)

0

500

1000

1500

2000

2500

3000

Sort Size = 100 GB Sort Size = 200 GB

Cluster Size = 12 Cluster Size = 24

Job

Exe

cuti

on

Tim

e (s

ec)

1GigE IPoIB (QDR) HadoopA-IB (QDR) OSU-IB (QDR)

HPC Advisory Council China Workshop, Oct '13 78

Page 79: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• 46% improvement in Adjacency List over IPoIB (32Gbps) for 30 GB data size

• 33% improvement in Sequence Count over IPoIB (32 Gbps) for 50 GB data

size

Evaluations using PUMA Workload

0

500

1000

1500

2000

2500

3000

Adjacency List (30GB) Self Join (30 GB) Word Count (50 GB) Sequence Count (50 GB) Inverted Index (50 GB)

Exec

uti

on

Tim

e (s

ec)

Workloads

IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council China Workshop, Oct '13 79

Page 80: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

Memcached Architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

• Native IB-verbs-level Design and evaluation with

– Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

80

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

HPC Advisory Council China Workshop, Oct '13

Page 81: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

81

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)

HPC Advisory Council China Workshop, Oct '13

Page 82: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 82

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

Page 83: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13 83

Application Level Evaluation – Workload from Yahoo

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached

design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

Page 84: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Keynote Talk, Accelerating Big Data Processing with Hadoop-

RDMA

– Forum of Big Data Computing (Oct 30th, 2:00-2:30pm)

• Keynote Talk, Challenges in Benchmarking and Evaluating Big

Data Processing Middleware on Modern Clusters

– Big Data Benchmarks, Performance Optimization and Emerging Hardware

(BPOE) Workshop (Oct 31st, 2:05-2:45pm)

84

More Details on Big Data and Benchmarking will be covered ..

HPC Advisory Council China Workshop, Oct '13

Page 85: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• InfiniBand with RDMA feature is gaining momentum in HPC and Big Data

systems with best performance and greater usage

• As the HPC community moves to Exascale, new solutions are needed in the

MPI+PGAS stacks for supporting GPUs and Accelerators

• Demonstrated how such solutions can be designed with

MVAPICH2/MVAPICH2-X and their performance benefits

• New solutions are also needed to re-design software libraries for Big Data

environments to take advantage of InfiniBand and RDMA

• Such designs will allow application scientists and engineers to take advantage

of upcoming exascale systems

85

Concluding Remarks

HPC Advisory Council China Workshop, Oct '13

Page 86: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13

Funding Acknowledgments

Funding Support by

Equipment Support by

86

Page 87: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

HPC Advisory Council China Workshop, Oct '13

Personnel Acknowledgments

Current Students

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

87

Past Research Scientist – S. Sur

Current Post-Docs

– X. Lu

– M. Luo

– K. Hamidouche

Current Programmers

– M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– M. Rahman (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

– D. Bureddy

Page 89: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

• Looking for Bright and Enthusiastic Personnel to join as

– Post-Doctoral Researchers (博士后)

– PhD Students (博士研究生)

– MPI Programmer/Software Engineer (MPI软件开发工程师)

– Hadoop/Big Data Programmer/Software Engineer (大数据软件开发工程

师)

• If interested, please contact me at this conference and/or send

an e-mail to [email protected] or [email protected]

state.edu

89

Multiple Positions Available in My Group

HPC Advisory Council China Workshop, Oct '13

Page 90: New Developments in Message Passing, PGAS and Big Data–Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum –Memcached is also used for Web 2.0 ... How does

90

Thanks, Q&A?

谢谢,请提问?

HPC Advisory Council China Workshop, Oct '13