high performance big data (hibd): accelerating hadoop...

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at Mellanox Theatre (SC ’15)

by




• Substantial impact on designing and utilizing modern data management and

processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Mellanox Booth (SC '15)

Data Processing and Management on Modern Clusters

2

Worker

Worker

Worker

Worker

Master

HDFS

Driver

Zookeeper

Worker

SparkContext

Spark - An Example of Back-end Data Processing Middleware

• An in-memory data-processing framework

– Iterative machine learning jobs

– Interactive data analytics

– Scala based Implementation

– Standalone, YARN, Mesos

• Scalable, communication and I/O intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication


http://spark.apache.org

3

http://spark.apache.org

Mellanox Booth (SC '15) 4

Memcached - An Example of Front-end Data Management Middleware

• Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Database Servers

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

NetworksInternet

• High End Computing (HEC) is growing dramatically

– High Performance Computing

– Big Data Computing

• Technology Advancement

– Multi-core/many-core technologies

– Remote Direct Memory Access (RDMA)-enabled

networking (InfiniBand and RoCE)

– Solid State Drives (SSDs), Non-Volatile Random-Access

Memory (NVRAM), NVMe-SSD

– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Drivers for Modern HPC Clusters

Tianhe – 2 Titan Stampede Tianhe – 1A



Kernel

Space

Interconnects and Protocols in the Open Fabrics Stack

Application / Middleware

Verbs

Ethernet

Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP

6

• Challenges for Accelerating Big Data Processing

• Accelerating Big Data Processing on RDMA-enabled High-

Performance Interconnects

– RDMA-enhanced Designs for HDFS, MapReduce, Spark, and Memcached

• Accelerating Big Data Processing on High-Performance Storage

– Enhanced Designs for HDFS and MapReduce over Lustre

– SSD-assisted Hybrid Memory for RDMA-based Memcached

• The High-Performance Big Data (HiBD) Project

• Conclusion and Q&A

Presentation Outline



How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major

bottlenecks in

current Big Data

processing

middleware (e.g.

Hadoop, Spark, and

Memcached)?

Can the bottlenecks be alleviated with

new designs by taking advantage of HPC technologies?

Can RDMA-enabled

high-performance

interconnects

benefit Big Data

processing?

Can HPC Clusters with

high-performance

storage systems (e.g.

SSD, parallel file

systems) benefit Big

Data applications?

How much

performance

benefits can be

achieved through

enhanced designs?

How to design

benchmarks for

evaluating the

performance of

Big Data

middleware on

HPC clusters?

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Challenges in Designing Communication and I/O Libraries for Big Data Systems


Upper level

Changes?

9

I/O and File Systems

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support


N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in

RDMA-Enhanced HDFS, HPDC '14, June 2014


Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR) Increased by 64% Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

12

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce


• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Hybrid overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup


13

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping

in MapReduce over High Performance Interconnects, ICS, June 2014.

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

14

Design Overview of Spark with RDMA

• Design Features

– RDMA based shuffle

– SEDA-based plugins

– Dynamic connection management and sharing

– Non-blocking and out-of-order data transfer

– Off-JVM-heap buffer management



• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

15

Performance Evaluation on TACC Stampede - SortByTest

• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)

• RDMA-based design for Spark 1.4.0

• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and

RamDisk. For SortByKey Test:

– Shuffle time reduced by up to 77% over IPoIB (56Gbps)

– Total time reduced by up to 58% over IPoIB (56Gbps)

16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time


0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA 77%

58%

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Native IB-verbs-level Design and evaluation with

– Server : Memcached (http://memcached.org)

– Client : libmemcached (http://libmemcached.org)

– Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

…

1

1

2

2


http://memcached.org

http://libmemcached.org

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (TACC Stampede)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X


Optimize I/O Performance for Big Data Applications with High-Performance Storage on HPC Clusters

MetaData Servers

Object Storage Servers

Compute Nodes

App Master

Map Reduce

RAM

Disk Lustre Client

Lustre Setup

• HPC Cluster Deployment – Hybrid topological solution of Beowulf

architecture with separate I/O nodes – Lean compute nodes with light OS; more

memory space; small local storage – Sub-cluster of dedicated I/O nodes with parallel

file systems, such as Lustre

• HDFS over Heterogeneous Storage – RAMDisk, SSD, HDD, Lustre as data directories

• MapReduce over Lustre – Local disk is used as the intermediate data

directory – Lustre is used as the intermediate data

directory

• Hybrid Memcached with RAM + SSD 20 Mellanox Booth (SC '15)

Map Reduce

SSD HDD

Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the

heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on

data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

21

Enhanced HDFS with In-memory and Heterogeneous Storage

Hybrid

Replication

Data Placement Policies

Eviction/

Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS

on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications


• For 160GB TestDFSIO in 32 nodes

– Write Throughput: 7x improvement

over IPoIB (FDR)

– Read Throughput: 2x improvement

over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

• For 120GB RandomWriter in 32

nodes

– 3x improvement over IPoIB (QDR)

0

1000

2000

3000

4000

5000

6000

write read

Tota

l Th

rou

ghp

ut

(MB

ps)

IPoIB (FDR) OSU-IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size


RandomWriter


Increased by 7x

Increased by 2x Reduced by 3x

22

Performance Improvement on SDSC Gordon (HHH vs. Tachyon)

• RandomWriter: 200GB on 32 SSD Nodes (QDR)

– HHH reduces the execution time by 47% over Tachyon-WTC (IPoIB)), 56% over HDFS (IPoIB)

• Sort: 200GB on 32 SSD Nodes (QDR)

– HHH reduces the execution time by 19% over Tachyon-WTC (IPoIB), 31% over HDFS (IPoIB)

RandomWriter

0

20

40

60

80

100

120

140

160

180

200

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size (GB)

HDFS Tachyon-WTC Tachyon-C HHH

Sort

0

50

100

150

200

250

300

350

400

450

500

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size (GB)

Reduced by 47% Reduced by 19%


N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File

Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre


• Design Features

– Two shuffle approaches

• Lustre read based shuffle

• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

24

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory

merge/sort reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN

MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-memory

merge/sort reduce

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede



– 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exe

cuti

on

Tim

e (s

ec)


M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory


25


– 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon


• For 120GB TeraSort in 16 nodes

– 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

0

100

200

300

400

500

600

700

800

900

40 60 80

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-Lustre-Read (QDR)

OSU-RDMA-IB (QDR)

OSU-Hybrid-IB (QDR)


26

Overview of SSD-Assisted Hybrid RDMA-Memcached Design

• Design Features

• Hybrid slab allocation and

management for higher data

retention

• Log-structured sequence of

blocks flushed to SSD

• SSD fast random read to

achieve low latency object

access

• Uses LRU to evict data to SSD


– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value

pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when

data does not fit in memory)

– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem

– When data does not fix in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Performance Evaluation on SDSC Comet (IB FDR + SATA/NVMe SSDs)


0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

Data Fits In Memory Data Does Not Fit In Memory

Late

ncy

(u

s)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache and HDP Hadoop distributions

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 135 organizations from 20 countries

• More than 13,300 downloads from the project site

The High-Performance Big Data (HiBD) Project


http://hibd.cse.ohio-state.edu




Different Modes of RDMA for Apache Hadoop 2.x

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to

have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to

perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides

support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks

and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different

running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

• Upcoming Releases of RDMA-enhanced Packages will support

– CDH Plugin

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce

– RPC

• Exploration of other components (Threading models, QoS, Virtualization,

Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations


Future Plans of OSU High Performance Big Data Project

32

• Discussed communication and I/O challenges in accelerating Big

Data middleware

• Presented initial designs to take advantage of InfiniBand/RDMA

and high-performance storage architectures for Hadoop, Spark,

and Memcached

• Presented challenges in designing benchmarks

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data processing community to take advantage of

modern HPC technologies to carry out their analytics in a fast and

scalable manner


Concluding Remarks

33

An Additional Talk

34

• Tomorrow, Thursday (11:00-11:30am) • Exploiting Full Potential of GPU Clusters with InfiniBand

using MVAPICH2-GDR


[email protected]


Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

35

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/




high performance big data (hibd): accelerating hadoop...

Documents