big data: hadoop and memcached - hpc advisory …...big data processing with hadoop •the...

Big Data: Hadoop and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Switzerland Conference (2014)

by




Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities

for enterprise information management

and decision making

• The amount of data is exploding;

companies are capturing and digitizing

more information than ever

• The rate of information growth appears

to be exceeding Moore’s Law

2 HPC Advisory Council Switzerland Conference, Apr '14

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf



• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

HPC Advisory Council Switzerland Conference, Apr '14 3

Not Only in Internet Services - Big Data in Scientific Domains

• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and

query it repeatedly

• Shark: http://shark.cs.berkeley.edu

– A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning,

continuous computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems


Typical Solutions or Architectures for Big Data Applications and Analytics

http://hadoop.apache.org

http://hadoop.apache.org

http://spark-project.org



http://shark.cs.berkeley.edu

http://shark.cs.berkeley.edu

http://storm-project.net



http://incubator.apache.org/s4

http://incubator.apache.org/s4

http://memcached.org

http://memcached.org

• Focuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment

is gaining a lot of momentum

• http://wiki.apache.org/hadoop/PoweredBy


Who Are Using Hadoop?

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html

• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html


Hadoop at Yahoo!

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline


Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• Major components

– HDFS

– MapReduce

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

8

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)

HPC Advisory Council Stanford Conference, Feb '14

Hadoop MapReduce Architecture & Operations

Disk Operations

• Map and Reduce Tasks carry out the total job execution

• Communication in shuffle phase uses HTTP over Java Sockets


Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg: Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-

independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC

RPC

RPC

RPC

HPC Advisory Council Switzerland Conference, Apr '14

Client


System-Level Interaction Between Clients and Data Nodes in HDFS

High

Performance

Networks

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

...

(HDFS Data Nodes)(HDFS Clients)

...

...


HBase

• Apache Hadoop Database

(http://hbase.apache.org/)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for platform-

independence and portability

• Uses sockets for

communication!

ZooKeeper

HBase

Client

HMaster

HRegion

Servers

HDFS

(HBase Architecture)


System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

......

... ...

...

High

Performance

Networks

High

Performance

Networks


Memcached Architecture

• Integral part of Web 2.0 architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges





Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list


18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share


• Message Passing Interface (MPI)

– Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

19

Use of High-Performance Networks for Scientific Computing



One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(u

s)


Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet infrastructure with Sockets

• Concerns for performance and scalability

• Usage of high-performance networks is beginning to draw interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with high-performance networks be used for Big Data applications using Hadoop?


Designing Communication and I/O Libraries for Big Data Systems: Challenges


Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks


Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User

space

RSockets

Can Hadoop be Accelerated with High-Performance Networks and Protocols?

26

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Hadoop and HBase)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface


Two Broad Approaches

• All components of Hadoop (HDFS, MapReduce, RPC, HBase, etc.) working on native IB verbs

• Hadoop Map-Reduce over verbs for existing HPC clusters with Lustre






• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-

level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-

based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.9 (03/31/14)

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project


http://hadoop-rdma.cse.ohio-state.edu/







Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

30

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support


Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

31

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)


Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node


Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

32

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)



Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node


• Cluster with 4 DataNodes, 2 HDD per node


• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31%

33

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE-1HDD 10GigE-2HDD

IPoIB -1HDD (QDR) IPoIB -2HDD (QDR)

OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)


34

Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node

– 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size


Increased by 47% Reduced by 16%

0

500

1000

1500

2000

2500

32 64 128

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

32 64 128

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)


Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR) Increased by 64% Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)


• Trends in High Performance Networking and Protocols



• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications




Job

Tracker

Task

Tracker

Map

Reduce



• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching of map outputs

– In-memory merge

– Advanced optimization in overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce



Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over UDA-IB (QDR)

38

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-

based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May

2013.


8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort


– 44% improvement over UDA-IB

(QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

UDA-IB (QDR) OSU-IB (QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

UDA-IB (QDR) OSU-IB (QDR)

Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node


– 41% improvement over UDA-IB (QDR)


0

200

400

600

800

1000

1200

1400

1600

1800

60 80 100

Job

Exe

cuti

on

Tim

e (

sec)

Data Size (GB)

10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)

• For 240GB Sort in 64 nodes


with HDD used for HDFS

40

Performance Evaluation on Larger Clusters


0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB

(FDR) with HDD used for HDFS

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks


• 50 small MapReduce jobs executed in a cluster size of 4

• Maximum performance benefit 24% over IPoIB (QDR)

• Average performance benefit 13% over IPoIB (QDR)

42

Evaluations using SWIM

15

20

25

30

35

40

1 6 11 16 21 26 31 36 41 46

Exe

cuti

on

Tim

e (

sec)

Job Number







• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges





Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,

Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12

0

20

40

60

80

100

120

140

160

180

200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

HBase Put 1KB

0

20

40

60

80

100

120

140

160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization


HBase Single-Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

10GigE

Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR)

IPoIB (DDR)


HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications




Our Design

Default

OSU Design

48

• Design Features

– JVM-bypassed buffer management

– RDMA or send/recv based adaptive communication

– Intelligent buffer allocation and adjustment for serialization





• JNI Layer bridges Java based RPC with communication library written in native code

Hadoop RPC with RDMA - Gain in Latency and Throughput

• Hadoop RPC with RDMA PingPong Latency

– 1 byte: 30 us; 4 KB: 38 us

– 59%-62% and 60%-63% improvements compared with the performance on 10 GigE

and IPoIB (32Gbps), respectively

• Hadoop RPC with RDMA Throughput

– 512 bytes & 56 clients: 170.63 Kops/sec

– 129% and 107% improvements compared with the peak performance on 10 GigE

and IPoIB (32Gbps), respectively


0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(u

s)

Payload Size (Byte)


0

20

40

60

80

100

120

140

160

180

8 16 24 32 40 48 56 64

Thro

ugh

pu

t (K

op

s/Se

c)

Number of Clients





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




• Testbeds

– OSU RI Cluster (HDD/SSD, QDR, 10GigE)

– TACC-Stampede (HDD, FDR)

– SDSC-Gordon (SSD, QDR)

• Benchmarks

– TestDFSIO

– RandomWriter & Sort

– TeraGen & TeraSort

• Benefits of New Features

– Parallel Replication vs. Pipeline Replication


Performance Benefits of RDMA for Apache Hadoop 0.9.9

0

500

1000

1500

2000

2500

3000

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)


• For Latency,

– 13% improvement over IPoIB, 18%

improvement over 10GigE for 5 GB file size



52

Performance Benefits – TestDFSIO (SSD)

Cluster with 4 Nodes, TestDFSIO with a total of 16 maps


0

5

10

15

20

25

30

35

40

45

5 10 15 20

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)


• For Throughput,





Reduced by 27% Increased by 43%

• For Latency,

– 36-38% improvement over

IPoIB for 80-120 GB file size

53

Performance Benefits – TestDFSIO in TACC Stampede

Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps


0

50

100

150

200

250

300

350

400

80 100 120

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

700

80 100 120

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)


• For Throughput,

– 60% improvement over

IPoIB for 80-120 GB file size

Reduced by 36%

Increased by 60%

0

50

100

150

200

250

300

350

400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)


0

20

40

60

80

100

120

140

160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)


Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per

node

– 16% improvement over IPoIB for 50GB in

a cluster of 16

– 20% improvement over IPoIB for 300GB in

a cluster of 64

Reduced by 20% Reduced by 36%


• Sort in a cluster of single SSD per node

– 20% improvement over IPoIB for

50GB in a cluster of 16

– 36% improvement over IPoIB for

300GB in a cluster of 64

0

200

400

600

800

1000

1200

80 100 120

Job

Exe

cuti

on

Tim

e (s

ec)

Sort Size (GB)


0

100

200

300

400

500

600

80 100 120

Job

Exe

cuti

on

Tim

e (

sec)

Size (GB)


55

Performance Benefits – TeraGen & TeraSort in TACC-Stampede

Cluster with 16 Nodes with a total of 64 maps and 32 reduces


• TeraGen with single HDD per

node


for 120GB data

• TeraSort with single HDD per

node


for 120GB data

Reduced by 17% Reduced by 38%

0

200

400

600

800

1000

1200

20 40 60

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

10GigE IPoIB (QDR)OSU-IB-Pipeline (QDR) OSU-IB-Parallel (QDR)

0

50

100

150

200

250

300

350

20 40 60

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)

10GigE

IPoIB (QDR)

OSU-IB-Pipeline (QDR)

OSU-IB-Parallel (QDR)

• For Latency,

– 11% improvement over OSU-IB-Pipeline for

20 GB file size

– 8% improvement over OSU-IB-Pipeline for 60

GB file size

56

Benefits of Parallel Replication – TestDFSIO

Cluster with 8 Nodes, TestDFSIO with a total of 32 maps


• For Throughput,


20 GB file size


60 GB file size

Reduced by 8%

Increased by 10%





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges





– 44% improvement over IPoIB (FDR)

58

Performance Improvement Potential of MapReduce over Lustre on HPC Clusters – An Initial Study


Sort in TACC Stampede Sort in SDSC Gordon



0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

60 80 100

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

• Can more optimizations be achieved by leveraging more features of Lustre?





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges




Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

…

1

1

2

2


61

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

IPoIB (QDR)

10GigE (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)


Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (DDR)

10GigE (DDR)

OSU-RC-IB (DDR)

OSU-UD-IB (DDR)


Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

63

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X



Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

10

20

30

40

50

60

1 4 8

Tim

e (

ms)

No. of Clients

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%


Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

10

20

30

40

50

60

70

80

90

1 4 8

Tim

e (

ms)

No. of Clients

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached

design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%





• HDFS

• MapReduce

• HBase

• RPC



– Memcached

• Open Challenges








(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges


Upper level

Changes?

67

• Multi-threading and Synchronization

– Multi-threaded model exploration

– Fine-grained synchronization and lock-free design

– Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization

– Network virtualization and locality-aware communication for Big

Data middleware

– Hardware-level virtualization support for End-to-End QoS

– I/O scheduling and storage virtualization

– Live migration


More Challenges

• Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of

NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and

offloaded computation

• Fault Tolerance Enhancements

– Novel data replication schemes for high-efficiency fault tolerance

– Exploration of light-weight fault tolerance mechanisms for Big Data

processing

• Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g.

Lustre) on modern HPC clusters


More Challenges (Cont’d)

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will

be its impact?

– Can performance gain/loss at the lower-layer be correlated to the

performance gain/loss observed at the upper layer?


Are the Current Benchmarks Sufficient for Big Data?





(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

71

Current

Benchmarks

No Benchmarks

Correlation?


OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack


http://mvapich.cse.ohio-state.edu/benchmarks







(HDD and SSD)


Applications



Other Protocols?



QoS



Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications


Applications-

Level

Benchmarks

Micro-

Benchmarks

• Upcoming RDMA for Apache Hadoop Releases will support

– HBase

– Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at http://hadoop-rdma.cse.ohio-state.edu


Future Plans of OSU Big Data Project

74

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, RPC, HBase and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern

HPC technologies to carry out their analytics in a fast and scalable

manner


Concluding Remarks

75


Personnel Acknowledgments

Current Students

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

76

Past Research Scientist – S. Sur

Current Post-Docs

– X. Lu

– K. Hamidouche

Current Programmers

– M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

– D. Bureddy


Pointers

http://nowlab.cse.ohio-state.edu

[email protected]


77

http://mvapich.cse.ohio-state.edu

http://hadoop-rdma.cse.ohio-state.edu

http://nowlab.cse.ohio-state.edu/




mailto:[email protected]






http://mvapich.cse.ohio-state.edu/









big data: hadoop and memcached - hpc advisory …...big data processing with hadoop •the...

Documents