accelerating data management and processing on · pdf fileaccelerating data management and...

92
Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Keynote Talk at ADMS 2014 by

Upload: hoangnguyet

Post on 19-Feb-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Keynote Talk at ADMS 2014

by

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most important elements of business analytics

• Provides groundbreaking opportunities for enterprise information management and decision making

• The amount of data is exploding; companies are capturing and digitizing more information than ever

• The rate of information growth appears to be exceeding Moore’s Law

ADMS '14 2

4V Characteristics of Big Data

ADMS '14 http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 4/5V’s of Big Data – 3V + *Veracity, *Value

3

Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 6.59% from 2010 to 2011 and now represents 2.1 Billion People.

http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute

ADMS '14 4

• Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline) • HDFS, MapReduce, Spark

ADMS '14

Data Management and Processing in Modern Datacenters

5

• Three-layer architecture of Web 2.0 – Web Servers, Memcached Servers, Database Servers

• Memcached is a core component of Web 2.0 architecture

6

Overview of Web 2.0 Architecture and Memcached

ADMS '14

ADMS '14 7

Memcached Architecture

• Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose

• Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive

8

HBase Overview

• Apache Hadoop Database (http://hbase.apache.org/)

• Semi-structured database, which is highly scalable

• Integral part of many datacenter applications – eg: Facebook Social Inbox

• Developed in Java for platform-independence and portability

• Uses sockets for communication!

(HBase Architecture)

ADMS '14

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop; highly reliable and fault-tolerant

• Adopted by many reputed organizations – eg: Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses sockets for communication!

(HDFS Architecture)

9

RPC

RPC

RPC

RPC

ADMS '14

Client

Data Movement in Hadoop MapReduce

Disk Operations • Map and Reduce Tasks carry out the total job execution

– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk – Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS

• Communication in shuffle phase uses HTTP over Java Sockets 10 ADMS '14

Bulk Data Transfer

Spark Architecture Overview

• An in-memory data-processing framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos

• Scalable and communication intensive – Wide dependencies between

Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

11 ADMS '14

http://spark.apache.org

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 12

High-End Computing (HEC): PetaFlop to ExaFlop

ADMS '14 13

Expected to have an ExaFlop system in 2020-2022!

100 PFlops in

2015

1 EFlops in 2018?

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0102030405060708090100

050

100150200250300350400450500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of ClustersNumber of Clusters

ADMS '14 14

• High End Computing (HEC) grows dramatically – High Performance Computing

– Big Data Computing

• Technology Advancement – Multi-core/many-core technologies and accelerators

– Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

– Solid State Drives (SSDs) and Non-Volatile Random-Access Memory (NVRAM)

– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

High End Computing (HEC)

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

ADMS '14 15

Overview of High Performance Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance – Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack (www.openfabrics.org) with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list

ADMS '14 16

ADMS '14

Kernel Space

All interconnects and protocols in OpenFabrics Stack Application / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP

17

Trends of Networking Technologies in TOP500 Systems Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share

ADMS '14 18

• 223 IB Clusters (44.3%) in the June 2014 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (25 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)

62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)

147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)

76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)

194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)

110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)

96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)

73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)

77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)

65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th) and many more!

ADMS '14 19

Open Standard InfiniBand Networking Technology • Introduced in Oct 2000 • High Performance Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec

-> 100Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram

– Provides flexibility to develop upper layers • Multiple Operations

– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

ADMS '14 20

ADMS '14

Communication in the Channel Semantics (Send/Receive Model)

InfiniBand Device

Memory Memory

InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple non-contiguous

segments)

Processor Processor

CQ QP

Send Recv

Memory Segment

Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a

receive WQE to know where to place

Hardware ACK

Memory Segment

Memory Segment

Memory Segment

Processor is involved only to:

1. Post receive WQE 2. Post send WQE

3. Pull out completed CQEs from the CQ

21

ADMS '14

Communication in the Memory Semantics (RDMA Model)

InfiniBand Device

Memory Memory

InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple segments) and the

receive buffer (single segment)

Processor Processor

CQ QP

Send Recv

Memory Segment

Hardware ACK

Memory Segment

Memory Segment

Initiator processor is involved only to: 1. Post send WQE

2. Pull out completed CQE from the send CQ

No involvement from the target processor

22

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 23

Wide Adaptation of RDMA Technology

• Message Passing Interface (MPI) for HPC

• Parallel File Systems – Lustre

– GPFS

• Delivering excellent performance (latency, bandwidth and CPU Utilization)

• Delivering excellent scalability

ADMS '14 24

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC)

– Used by more than 2,200 organizations in 73 countries

– More than 221,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC

• 13th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

• 23th ranked 96,192-core cluster (Pleiades) at NASA

• many others . . .

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

ADMS '14 25

ADMS '14

One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(us)

26

ADMS '14

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Band

wid

th (M

Byte

s/se

c)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

27

Can High-Performance Interconnects Benefit Data Management and Processing? • Most of the current Big Data systems use Ethernet

Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest – Oracle, IBM, Google, Intel are working along these directions

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached?

ADMS '14 28

Designing Communication and I/O Libraries for Big Data Systems: Challenges

ADMS '14

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

29

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

ADMS '14 30

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 31

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB) • http://hibd.cse.ohio-state.edu

• RDMA for Apache HBase and Spark

The High-Performance Big Data (HiBD) Project

ADMS '14 32

• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for Memcached and libMemcached components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.1

– Based on Memcached 1.4.20 and libMemcached 1.0.18.

– Compliant with Memcached APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

– http://hibd.cse.ohio-state.edu

RDMA for Memcached Distribution

ADMS '14 33

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.9/0.9.1

– Based on Apache Hadoop 1.2.1/2.4.1

– Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications

– Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 1.x/2.x Distributions

ADMS '14 34

• Released in OHB 0.7.1 (ohb_memlat)

• Evaluates the performance of stand-alone Memcached

• Three different micro-benchmarks – SET Micro-benchmark: Micro-benchmark for memcached set

operations

– GET Micro-benchmark: Micro-benchmark for memcached get operations

– MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10)

• Calculates average latency of Memcached operations

• Can measure throughput in Transactions Per Second

ADMS '14

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached

35

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 36

Design Challenges of RDMA-based Memcached

• Can Memcached be re-designed from the ground up to utilize RDMA capable networks?

• How efficiently can we utilize RDMA for Memcached operations?

• Can we leverage the best features of RC and UD to deliver both high performance and scalability to Memcached?

• Memcached applications need not be modified; the middleware uses verbs interface if available

ADMS '14 37

Memcached-RDMA Design

• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Native IB-verbs-level Design and evaluation with – Server : Memcached (http://memcached.org)

– Client : libmemcached (http://libmemcached.org)

– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

ADMS '14 38

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size

IPoIB (QDR)

10GigE (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

ADMS '14

IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)

39

ADMS '14

Memcached Get TPS (4 bytes)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Thou

sand

s of T

rans

actio

ns p

er se

cond

(TPS

)

No. of Clients

0

200

400

600

800

1000

1200

1400

1600

4 8

Thou

sand

s of T

rans

actio

ns p

er se

cond

(TPS

)

No. of Clients

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

40

1

10

100

10001 2 4 8 16 32 64 128

256

512 1K 2K 4K

Tim

e (u

s)

Message Size

OSU-IB (FDR)IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Thou

sand

s of T

rans

actio

ns p

er

Seco

nd (T

PS)

No. of Clients

• Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

ADMS '14 41

ADMS '14

Application Level Evaluation – Olio Benchmark

• Olio Benchmark – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (m

s)

No. of Clients

0

10

20

30

40

50

60

1 4 8

Tim

e (m

s)

No. of Clients

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

42

ADMS '14

Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

10

20

30

40

50

60

70

80

90

1 4 8

Tim

e (m

s)

No. of Clients

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (m

s)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

43

ADMS '14 44

Memcached-RDMA – Case Studies with OLTP Workloads

• Design of scalable architectures using Memcached and MySQL

• Case study to evaluate benefits of using RDMA-Memcached for traditional OLTP workloads

• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool

• Up to 40 nodes, 10 concurrent threads per node, 20 queries per client

• Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

ADMS '14 45

Evaluation with Traditional OLTP Workloads - mysqlslap

012345678

48 64 80 96 112 128 144 160 240 320 400

Late

ncy

(sec

)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

500

1000

1500

2000

2500

3000

3500

48 64 80 96 112 128 144 160 240 320 400

Thro

ughp

ut (

q/m

s)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

ADMS '14 46

Traditional Memcached Deployments

•Low hit ratio at Memcached server due to Limited main memory size

•Frequent access to backend DB

•Higher hit ratio due to SSD Mapped virtual memory •Significant overhead at virtual memory management

Existing memcached deployment Mmap() SSD into virtual memory

ADMS '14 47

SSD-Assisted Hybrid Memory for Memcached

IB Verbs IPoIB 10GigE 1GigE

MySQL N/A 10763 10724 11220

Memcached (In RAM) 10 60 40 150

Memcached (SSD virtual

memory) 347 387 362 455

Random Read

Random Write

SSD Latency 68 70

Get Latency (us)

SSD Basic Performance (us) (Fusionio ioDrive) •SSD-Assisted Hybrid Memory to expand Memory size •A user-level library that bypasses the kernel-based virtual memory management overhead

ADMS '14 48

Evaluation on SSD-Assisted Hybrid Memory for Memcached - Latency

3.6 X

347 us

93 us 16 us

48 us

3 X

Memcached Get Latency Memcached Put Latency

• Memcached with InfiniBand DDR transport (16Gbps) • 30GB data in SSD, 256 MB read/write buffer • Get / Put a random object X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang & D. K. Panda, SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, ICPP’12

ADMS '14 49

Evaluation on SSD-Assisted Hybrid Memory for Memcached - Throughput

• Memcached with InfiniBand DDR transport (16Gbps) • 30GB data in SSD, object size = 1KB, 256 MB read/write buffer • 1, 2, 4 and 8 client process to perform random get()

5.3 X

0

20

40

60

80

1 2 4 8

Tran

sact

ions

/sec

(t

hous

ands

)

# processes

Native Hmem (upper-bound) Memcached+IB+Hmem Memcached+IB+VMS

ADMS '14

Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,

Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, Poster Paper, ISPASS’12

0

20

40

60

80

100

120

140

160

180

200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (u

s)

HBase Put 1KB

0

20

40

60

80

100

120

140

160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (u

s)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

50

ADMS '14

HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code

• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE, IPoIB Networks

Java Socket Interface

Java Native Interface (JNI)

51

HBase-RDMA Design - Communication Flow

Application

HBase Client

Network Selector

Sockets

1/10G Network

Sockets

1/10G Network

JNI Interface

OSU-IB Design

IBVerbs

JNI Interface

OSU-IB Design

IBVerbs

IB Reader Responder

Handler Threads

Helper Threads

Reader Threads

(HBase Client)

(HBase Server)

• RDMA design components – Network Selector, IB Reader, Helper threads JNI Adaptive Interface

Call Queue

Response Queue

ADMS '14 52

ADMS '14

HBase Micro-benchmark (Single-Server-Multi-Client) Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Tim

e (u

s)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Ops

/sec

No. of Clients

10GigE

Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR)

IPoIB (DDR)

53

ADMS '14

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients

• HBase Put latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

54

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 55

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14 56

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features – RDMA-based HDFS

write – RDMA-based HDFS

replication – Parallel replication

support – On-demand connection

setup – InfiniBand/RoCE

support

ADMS '14 57

Communication Times in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

ADMS '14

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Com

mun

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

58

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

Evaluations using OHB HDFS Micro-benchmark

• Cluster with 4 HDD DataNodes, single disk per node

– 25% improvement in latency over IPoIB (QDR) for 10GB file size

– 50% improvement in throughput over IPoIB (QDR) for 10GB file size

Write Latency Write Throughput with 2 clients

reduced by 25% Increased by 50%

ADMS '14

0

10

20

30

40

50

60

70

80

90

100

2 4 6 8 10

Writ

e La

tenc

y (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

0

100

200

300

400

500

600

2 4 6 8 10

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

59

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012.

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

ADMS '14

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Thr

ough

put (

MBp

s)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

60

ADMS '14

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node – 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Aggr

egat

ed T

hrou

ghpu

t (M

Bps)

File Size (GB)

IPoIB (FDR)OSU-IB (FDR) Increased by 64%

Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

utio

n Ti

me

(s)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

61

00.5

11.5

22.5

33.5

44.5

5

120K 240K 360K 480K

Thro

ughp

ut (K

ops/

sec)

Number of Records (Record Size = 1KB)

IPoIB (QDR) OSU-IB (QDR)

0

50

100

150

200

250

300

120K 240K 360K 480K

Aver

age

Late

ncy

(us)

Number of Records (Record Size = 1KB)

IPoIB (QDR) OSU-IB (QDR)

Evaluations using YCSB (32 Region Servers: 100% Update)

Put Latency Put Throughput

• HBase using TCP/IP, running over HDFS-IB

• HBase Put latency for 480K records

– 201 us for OSU Design; 272 us for IPoIB (32Gbps)

• HBase Put throughput for 480K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

• 26% improvement in average latency; 24% improvement in throughput ADMS '14 62

• YCSB Evaluation with 4 RegionServers (100% update)

• HBase Put Latency and Throughput for 360K records

- 37% improvement over IPoIB (32Gbps)

- 18% improvement over OSU-IB HDFS only

HDFS and HBase Integration over IB (OSU-IB)

0

50

100

150

200

250

300

120K 240K 360K 480K

Late

ncy

(us)

Number of Records (Record Size = 1KB)

IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR)

OSU-IB (HDFS) - OSU-IB (HBase) (QDR)

0

1

2

3

4

5

6

120K 240K 360K 480K

Thro

ughp

ut (K

ops/

sec)

Number of Records (Record Size = 1KB)

IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR)

OSU-IB (HDFS) - OSU-IB (HBase) (QDR)

ADMS '14 63

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14 64

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job Tracker

Task Tracker

Map

Reduce

ADMS '14

• Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features – RDMA-based shuffle – Prefetching and caching map

output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle

Adjustment – Advanced overlapping

• map, shuffle, and merge • shuffle, merge, and reduce

– On-demand connection setup – InfiniBand/RoCE support

65

Advanced Overlapping among different phases

• A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches – Efficient Shuffle Algorithms – Dynamic and Efficient Switching – On-demand Shuffle Adjustment

Default Architecture

Enhanced Overlapping

Advanced Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.

ADMS '14 66

Evaluations using OHB MapReduce Micro-benchmark

• Stand-alone MapReduce micro-benchmark (MR-AVG)

• 1 KB key/value pair size

• For 8 slave nodes, RDMA has up to 30% over IPoIB (56Gbps)

• For 16 slave nodes, RDMA has up to 28% over IPoIB (56Gbps)

8 slave nodes 16 slave nodes

ADMS '14

0

100

200

300

400

500

600

700

800

900

8 16 32 64

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128

IPoIB (FDR) OSU-IB (FDR)

28% 30%

67

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014).

Evaluations using Sort (HDD, SSD)

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 43% improvement over IPoIB (QDR)

– 44% improvement over UDA-IB (QDR)

ADMS '14

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort

– 52% improvement over IPoIB (QDR)

– 45% improvement over UDA-IB (QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)

0

100

200

300

400

500

600

25 30 35 40

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)

68

0

200

400

600

800

1000

1200

1400

1600

1800

2000

60 80 100

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Evaluations with TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node – 49% benefit compared to UDA-IB (QDR)

– 54% benefit compared to IPoIB (QDR)

– 56% benefit compared to 10GigE

49%

54%

ADMS '14 69

• For 240GB Sort in 64 nodes – 40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evaluation on Larger Clusters

ADMS '14

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes – 38% improvement over IPoIB

(FDR) with HDD used for HDFS

70

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

ADMS '14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

Nor

mal

ized

Exe

cutio

n Ti

me

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

71

• 50 small MapReduce jobs executed in a cluster size of 4

• Maximum performance benefit 24% over IPoIB (QDR)

• Average performance benefit 13% over IPoIB (QDR)

Evaluations using SWIM

15

20

25

30

35

40

1 6 11 16 21 26 31 36 41 46

Exec

utio

n Ti

me

(sec

)

Job Number

IPoIB (QDR) OSU-IB (QDR)

ADMS '14 72

• RDMA-based Designs and Performance Evaluation – HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14 73

Design Overview of Spark with RDMA

• Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection

management and sharing – Non-blocking and out-of-

order data transfer – Off-JVM-heap buffer

management – InfiniBand/RoCE support

ADMS '14

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

74

ADMS '14

Preliminary Results of Spark-RDMA Design - GroupBy

0

1

2

3

4

5

6

7

8

9

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

0

1

2

3

4

5

6

7

8

9

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

10GigE

IPoIB

RDMA

Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores

• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks

– 18% improvement over IPoIB (QDR) for 10GB data size

• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks

– 20% improvement over IPoIB (QDR) for 20GB data size

75

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 76

Optimize Apache Hadoop over Parallel File Systems

MetaData Servers

Object Storage Servers

Compute Nodes TaskTracker

Map Reduce

Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with

parallel file systems, such as Lustre • MapReduce over Lustre

– Local disk is used as the intermediate data directory

– Lustre is used as the intermediate data directory

ADMS '14 77

• For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR)

Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede

ADMS '14

• For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (FDR)OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory

78

• For 160GB Sort in 16 nodes – 35% improvement over IPoIB (FDR)

Case Study - Performance Improvement of MapReduce over Lustre on TACC-Stampede

ADMS '14

• For 320GB Sort in 32 nodes – 33% improvement over IPoIB (FDR)

• Lustre is used as the intermediate data directory

0

20

40

60

80

100

120

140

160

180

200

80 120 160

Job

Exec

utio

n Ti

me

(sec

)

Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

50

100

150

200

250

300

80 GB 160 GB 320 GB

Cluster: 8 Cluster: 16 Cluster: 32Jo

b Ex

ecut

ion

Tim

e (s

ec)

IPoIB (FDR) OSU-IB (FDR)

• Can more optimizations be achieved by leveraging more features of Lustre? 79

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 80

• Some other existing Hadoop solutions – Hadoop-A (UDA)

• RDMA-based implementation of Hadoop MapReduce shuffle engine

• Uses plug-in based solution : UDA (Unstructured Data Accelerator)

• http://www.mellanox.com/page/products_dyn?product_family=144

– Cloudera Distributions of Hadoop • CDH: Open Source distribution

• Cloudera Standard: CDH with automated cluster management

• Cloudera Enterprise: Cloudera Standard + enhanced management capabilities and support

• Ethernet/IPoIB

• http://www.cloudera.com/content/cloudera/en/products.html

– Hortonworks Data Platform • HDP: Open Source distribution

• Developed as projects through the Apache Software Foundation (ASF), NO proprietary extensions or add-ons

• Functional areas: Data Management, Data Access, Data Governance and Integration, Security, and Operations.

• Ethernet/IPoIB

• http://hortonworks.com/hdp

ADMS '14

Ongoing and Future Activities for Hadoop Accelerations

81

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

ADMS '14

Upper level Changes?

82

• Multi-threading and Synchronization – Multi-threaded model exploration

– Fine-grained synchronization and lock-free design

– Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization – Network virtualization and locality-aware communication for Big

Data middleware

– Hardware-level virtualization support for End-to-End QoS

– I/O scheduling and storage virtualization

– Live migration

ADMS '14

More Challenges

83

• Support of Accelerators – Efficient designs for Big Data middleware to take advantage of

NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and offloaded computation

• Fault Tolerance Enhancements – Exploration of light-weight fault tolerance mechanisms for Big Data

• Support of Parallel File Systems – Optimize Big Data middleware over parallel file systems (e.g.

Lustre) on modern HPC clusters

• Big Data Benchmarking

ADMS '14

More Challenges (Cont’d)

84

• The current benchmarks provide some performance behavior

• However, do not provide any information to the designer/developer on: – What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will be its impact?

– Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

ADMS '14

Are the Current Benchmarks Sufficient for Big Data Management and Processing?

85

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current Benchmarks

No Benchmarks

Correlation?

ADMS '14 86

OSU MPI Micro-Benchmarks (OMB) Suite • A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC

• Has become an industry standard • Extensively used for design/development of MPI libraries, performance comparison

of MPI libraries and even in procurement of large-scale systems • Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

ADMS '14 87

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE

and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

ADMS '14

Applications-Level

Benchmarks

Micro-Benchmarks

88

• Upcoming Releases of RDMA-enhanced Packages will support – Hadoop 2.x MapReduce & RPC

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– HDFS

– MapReduce

– RPC

• Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations ADMS '14

Future Plans of OSU High Performance Big Data Project

89

• Overview of Modern Clusters, Interconnects and Protocols • Challenges for Accelerating Data Management and Processing • The High-Performance Big Data (HiBD) Project • RDMA-based design for Memcached and HBase

• RDMA-based designs for Apache Hadoop and Spark – Case studies with HDFS, MapReduce, and Spark

– RDMA-based MapReduce on HPC Clusters with Lustre

• Ongoing and Future Activities • Conclusion and Q&A

Presentation Outline

ADMS '14 90

• Presented an overview of data management and processing middleware in different tiers

• Provided an overview of modern networking technologies

• Discussed challenges in accelerating Big Data middleware

• Presented initial designs to take advantage of InfiniBand/RDMA for Memcached, HBase, Hadoop and Spark

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data management and processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

ADMS '14

Concluding Remarks

91

[email protected]

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

ADMS '14

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

92