big data: hadoop and memcached - hpc advisory …...big data processing with hadoop •the...

77
Big Data: Hadoop and Memcached Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Switzerland Conference (2014) by

Upload: others

Post on 28-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Big Data: Hadoop and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Switzerland Conference (2014)

by

Page 2: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities

for enterprise information management

and decision making

• The amount of data is exploding;

companies are capturing and digitizing

more information than ever

• The rate of information growth appears

to be exceeding Moore’s Law

2 HPC Advisory Council Switzerland Conference, Apr '14

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

Page 3: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

HPC Advisory Council Switzerland Conference, Apr '14 3

Not Only in Internet Services - Big Data in Scientific Domains

Page 4: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and

query it repeatedly

• Shark: http://shark.cs.berkeley.edu

– A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning,

continuous computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems

HPC Advisory Council Switzerland Conference, Apr '14 4

Typical Solutions or Architectures for Big Data Applications and Analytics

Page 5: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Focuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment

is gaining a lot of momentum

• http://wiki.apache.org/hadoop/PoweredBy

HPC Advisory Council Switzerland Conference, Apr '14 5

Who Are Using Hadoop?

Page 6: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html

• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html

HPC Advisory Council Switzerland Conference, Apr '14 6

Hadoop at Yahoo!

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

Page 7: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

7 HPC Advisory Council Switzerland Conference, Apr '14

Page 8: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• Major components

– HDFS

– MapReduce

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

8

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)

HPC Advisory Council Stanford Conference, Feb '14

Page 9: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Hadoop MapReduce Architecture & Operations

Disk Operations

• Map and Reduce Tasks carry out the total job execution

• Communication in shuffle phase uses HTTP over Java Sockets

9 HPC Advisory Council Switzerland Conference, Apr '14

Page 10: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg: Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-

independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC

RPC

RPC

RPC

HPC Advisory Council Switzerland Conference, Apr '14

Client

Page 11: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 11

System-Level Interaction Between Clients and Data Nodes in HDFS

High

Performance

Networks

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

...

(HDFS Data Nodes)(HDFS Clients)

...

...

Page 12: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 12

HBase

• Apache Hadoop Database

(http://hbase.apache.org/)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for platform-

independence and portability

• Uses sockets for

communication!

ZooKeeper

HBase

Client

HMaster

HRegion

Servers

HDFS

(HBase Architecture)

Page 13: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 13

System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

......

... ...

...

High

Performance

Networks

High

Performance

Networks

Page 14: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 14

Memcached Architecture

• Integral part of Web 2.0 architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Page 15: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

15 HPC Advisory Council Switzerland Conference, Apr '14

Page 16: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 16

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

Page 17: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list

17 HPC Advisory Council Switzerland Conference, Apr '14

Page 18: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share

HPC Advisory Council Switzerland Conference, Apr '14

Page 19: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Message Passing Interface (MPI)

– Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

19

Use of High-Performance Networks for Scientific Computing

HPC Advisory Council Switzerland Conference, Apr '14

Page 20: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 20

One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 21: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 21

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

Page 22: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

22 HPC Advisory Council Switzerland Conference, Apr '14

Page 23: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet infrastructure with Sockets

• Concerns for performance and scalability

• Usage of high-performance networks is beginning to draw interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with high-performance networks be used for Big Data applications using Hadoop?

23 HPC Advisory Council Switzerland Conference, Apr '14

Page 24: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Designing Communication and I/O Libraries for Big Data Systems: Challenges

HPC Advisory Council Switzerland Conference, Apr '14 24

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

Page 25: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 25

Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User

space

RSockets

Page 26: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Can Hadoop be Accelerated with High-Performance Networks and Protocols?

26

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Hadoop and HBase)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPC Advisory Council Switzerland Conference, Apr '14

Page 27: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Two Broad Approaches

• All components of Hadoop (HDFS, MapReduce, RPC, HBase, etc.) working on native IB verbs

• Hadoop Map-Reduce over verbs for existing HPC clusters with Lustre

27 HPC Advisory Council Switzerland Conference, Apr '14

Page 28: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

28 HPC Advisory Council Switzerland Conference, Apr '14

Page 29: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-

level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-

based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.9 (03/31/14)

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project

29 HPC Advisory Council Switzerland Conference, Apr '14

Page 30: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

30

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support

HPC Advisory Council Switzerland Conference, Apr '14

Page 31: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

31

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Switzerland Conference, Apr '14

Page 32: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

32

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Switzerland Conference, Apr '14

Page 33: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 DataNodes, 2 HDD per node

– 31% improvement over IPoIB (QDR) for 20GB file size

• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31%

33

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE-1HDD 10GigE-2HDD

IPoIB -1HDD (QDR) IPoIB -2HDD (QDR)

OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)

HPC Advisory Council Switzerland Conference, Apr '14

Page 34: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

34

Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node

– 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size

HPC Advisory Council Switzerland Conference, Apr '14

Increased by 47% Reduced by 16%

0

500

1000

1500

2000

2500

32 64 128

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

32 64 128

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

Page 35: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 35

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR) Increased by 64% Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

Page 36: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

36 HPC Advisory Council Switzerland Conference, Apr '14

Page 37: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce

37 HPC Advisory Council Switzerland Conference, Apr '14

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching of map outputs

– In-memory merge

– Advanced optimization in overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup

– InfiniBand/RoCE support

Page 38: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over UDA-IB (QDR)

38

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-

based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May

2013.

HPC Advisory Council Switzerland Conference, Apr '14

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort

– 52% improvement over IPoIB (QDR)

– 44% improvement over UDA-IB

(QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

UDA-IB (QDR) OSU-IB (QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

UDA-IB (QDR) OSU-IB (QDR)

Page 39: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node

– 51% improvement over IPoIB (QDR)

– 41% improvement over UDA-IB (QDR)

39 HPC Advisory Council Switzerland Conference, Apr '14

0

200

400

600

800

1000

1200

1400

1600

1800

60 80 100

Job

Exe

cuti

on

Tim

e (

sec)

Data Size (GB)

10GigE IPoIB (QDR)UDA-IB (QDR) OSU-IB (QDR)

Page 40: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• For 240GB Sort in 64 nodes

– 40% improvement over IPoIB (QDR)

with HDD used for HDFS

40

Performance Evaluation on Larger Clusters

HPC Advisory Council Switzerland Conference, Apr '14

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB

(FDR) with HDD used for HDFS

Page 41: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

41 HPC Advisory Council Switzerland Conference, Apr '14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

Page 42: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• 50 small MapReduce jobs executed in a cluster size of 4

• Maximum performance benefit 24% over IPoIB (QDR)

• Average performance benefit 13% over IPoIB (QDR)

42

Evaluations using SWIM

15

20

25

30

35

40

1 6 11 16 21 26 31 36 41 46

Exe

cuti

on

Tim

e (

sec)

Job Number

IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Switzerland Conference, Apr '14

Page 43: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

43 HPC Advisory Council Switzerland Conference, Apr '14

Page 44: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 44

Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,

Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12

0

20

40

60

80

100

120

140

160

180

200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

HBase Put 1KB

0

20

40

60

80

100

120

140

160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

Page 45: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 45

HBase Single-Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

10GigE

Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR)

IPoIB (DDR)

Page 46: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 46

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

Page 47: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

47 HPC Advisory Council Switzerland Conference, Apr '14

Page 48: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Our Design

Default

OSU Design

48

• Design Features

– JVM-bypassed buffer management

– RDMA or send/recv based adaptive communication

– Intelligent buffer allocation and adjustment for serialization

– On-demand connection setup

– InfiniBand/RoCE support

HPC Advisory Council Switzerland Conference, Apr '14

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based RPC with communication library written in native code

Page 49: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Hadoop RPC with RDMA - Gain in Latency and Throughput

• Hadoop RPC with RDMA PingPong Latency

– 1 byte: 30 us; 4 KB: 38 us

– 59%-62% and 60%-63% improvements compared with the performance on 10 GigE

and IPoIB (32Gbps), respectively

• Hadoop RPC with RDMA Throughput

– 512 bytes & 56 clients: 170.63 Kops/sec

– 129% and 107% improvements compared with the peak performance on 10 GigE

and IPoIB (32Gbps), respectively

49 HPC Advisory Council Switzerland Conference, Apr '14

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(u

s)

Payload Size (Byte)

10GigE IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

180

8 16 24 32 40 48 56 64

Thro

ugh

pu

t (K

op

s/Se

c)

Number of Clients

Page 50: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

50 HPC Advisory Council Switzerland Conference, Apr '14

Page 51: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Testbeds

– OSU RI Cluster (HDD/SSD, QDR, 10GigE)

– TACC-Stampede (HDD, FDR)

– SDSC-Gordon (SSD, QDR)

• Benchmarks

– TestDFSIO

– RandomWriter & Sort

– TeraGen & TeraSort

• Benefits of New Features

– Parallel Replication vs. Pipeline Replication

HPC Advisory Council Switzerland Conference, Apr '14 51

Performance Benefits of RDMA for Apache Hadoop 0.9.9

Page 52: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

0

500

1000

1500

2000

2500

3000

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

• For Latency,

– 13% improvement over IPoIB, 18%

improvement over 10GigE for 5 GB file size

– 27% improvement over IPoIB, 30%

improvement over 10GigE for 20 GB file size

52

Performance Benefits – TestDFSIO (SSD)

Cluster with 4 Nodes, TestDFSIO with a total of 16 maps

HPC Advisory Council Switzerland Conference, Apr '14

0

5

10

15

20

25

30

35

40

45

5 10 15 20

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

• For Throughput,

– 47% improvement over IPoIB, 64%

improvement over 10GigE for 5 GB file size

– 43% improvement over IPoIB, 58%

improvement over 10GigE for 20 GB file size

Reduced by 27% Increased by 43%

Page 53: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• For Latency,

– 36-38% improvement over

IPoIB for 80-120 GB file size

53

Performance Benefits – TestDFSIO in TACC Stampede

Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps

HPC Advisory Council Switzerland Conference, Apr '14

0

50

100

150

200

250

300

350

400

80 100 120

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

700

80 100 120

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

IPoIB (FDR) OSU-IB (FDR)

• For Throughput,

– 60% improvement over

IPoIB for 80-120 GB file size

Reduced by 36%

Increased by 60%

Page 54: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

0

50

100

150

200

250

300

350

400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)

IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)

IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per

node

– 16% improvement over IPoIB for 50GB in

a cluster of 16

– 20% improvement over IPoIB for 300GB in

a cluster of 64

Reduced by 20% Reduced by 36%

54 HPC Advisory Council Switzerland Conference, Apr '14

• Sort in a cluster of single SSD per node

– 20% improvement over IPoIB for

50GB in a cluster of 16

– 36% improvement over IPoIB for

300GB in a cluster of 64

Page 55: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

0

200

400

600

800

1000

1200

80 100 120

Job

Exe

cuti

on

Tim

e (s

ec)

Sort Size (GB)

IPoIB (FDR) OSU-IB (FDR)

0

100

200

300

400

500

600

80 100 120

Job

Exe

cuti

on

Tim

e (

sec)

Size (GB)

IPoIB (FDR) OSU-IB (FDR)

55

Performance Benefits – TeraGen & TeraSort in TACC-Stampede

Cluster with 16 Nodes with a total of 64 maps and 32 reduces

HPC Advisory Council Switzerland Conference, Apr '14

• TeraGen with single HDD per

node

– 17% improvement over IPoIB

for 120GB data

• TeraSort with single HDD per

node

– 38% improvement over IPoIB

for 120GB data

Reduced by 17% Reduced by 38%

Page 56: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

0

200

400

600

800

1000

1200

20 40 60

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

10GigE IPoIB (QDR)OSU-IB-Pipeline (QDR) OSU-IB-Parallel (QDR)

0

50

100

150

200

250

300

350

20 40 60

Job

Exe

cuti

on

Tim

e (s

ec)

Size (GB)

10GigE

IPoIB (QDR)

OSU-IB-Pipeline (QDR)

OSU-IB-Parallel (QDR)

• For Latency,

– 11% improvement over OSU-IB-Pipeline for

20 GB file size

– 8% improvement over OSU-IB-Pipeline for 60

GB file size

56

Benefits of Parallel Replication – TestDFSIO

Cluster with 8 Nodes, TestDFSIO with a total of 32 maps

HPC Advisory Council Switzerland Conference, Apr '14

• For Throughput,

– 12% improvement over OSU-IB-Pipeline for

20 GB file size

– 10% improvement over OSU-IB-Pipeline for

60 GB file size

Reduced by 8%

Increased by 10%

Page 57: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

57 HPC Advisory Council Switzerland Conference, Apr '14

Page 58: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

58

Performance Improvement Potential of MapReduce over Lustre on HPC Clusters – An Initial Study

HPC Advisory Council Switzerland Conference, Apr '14

Sort in TACC Stampede Sort in SDSC Gordon

• For 100GB Sort in 16 nodes

– 33% improvement over IPoIB (QDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

60 80 100

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

• Can more optimizations be achieved by leveraging more features of Lustre?

Page 59: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

59 HPC Advisory Council Switzerland Conference, Apr '14

Page 60: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

60 HPC Advisory Council Switzerland Conference, Apr '14

Page 61: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

61

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

IPoIB (QDR)

10GigE (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

HPC Advisory Council Switzerland Conference, Apr '14

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (DDR)

10GigE (DDR)

OSU-RC-IB (DDR)

OSU-UD-IB (DDR)

Page 62: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 62

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Page 63: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

63

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

HPC Advisory Council Switzerland Conference, Apr '14

Page 64: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 64

Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

10

20

30

40

50

60

1 4 8

Tim

e (

ms)

No. of Clients

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%

Page 65: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14 65

Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

10

20

30

40

50

60

70

80

90

1 4 8

Tim

e (

ms)

No. of Clients

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached

design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%

Page 66: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Hadoop Map-Reduce over Lustre

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

66 HPC Advisory Council Switzerland Conference, Apr '14

Page 67: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

HPC Advisory Council Switzerland Conference, Apr '14

Upper level

Changes?

67

Page 68: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Multi-threading and Synchronization

– Multi-threaded model exploration

– Fine-grained synchronization and lock-free design

– Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization

– Network virtualization and locality-aware communication for Big

Data middleware

– Hardware-level virtualization support for End-to-End QoS

– I/O scheduling and storage virtualization

– Live migration

HPC Advisory Council Switzerland Conference, Apr '14 68

More Challenges

Page 69: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of

NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and

offloaded computation

• Fault Tolerance Enhancements

– Novel data replication schemes for high-efficiency fault tolerance

– Exploration of light-weight fault tolerance mechanisms for Big Data

processing

• Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g.

Lustre) on modern HPC clusters

HPC Advisory Council Switzerland Conference, Apr '14 69

More Challenges (Cont’d)

Page 70: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will

be its impact?

– Can performance gain/loss at the lower-layer be correlated to the

performance gain/loss observed at the upper layer?

HPC Advisory Council Switzerland Conference, Apr '14 70

Are the Current Benchmarks Sufficient for Big Data?

Page 71: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

71

Current

Benchmarks

No Benchmarks

Correlation?

HPC Advisory Council Switzerland Conference, Apr '14

Page 72: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

72 HPC Advisory Council Switzerland Conference, Apr '14

Page 73: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

HPC Advisory Council Switzerland Conference, Apr '14 73

Applications-

Level

Benchmarks

Micro-

Benchmarks

Page 74: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Upcoming RDMA for Apache Hadoop Releases will support

– HBase

– Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at http://hadoop-rdma.cse.ohio-state.edu

HPC Advisory Council Switzerland Conference, Apr '14

Future Plans of OSU Big Data Project

74

Page 75: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, RPC, HBase and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern

HPC technologies to carry out their analytics in a fast and scalable

manner

HPC Advisory Council Switzerland Conference, Apr '14

Concluding Remarks

75

Page 76: Big Data: Hadoop and Memcached - HPC Advisory …...Big Data Processing with Hadoop •The open-source implementation of MapReduce programming model for Big Data Analytics •Major

HPC Advisory Council Switzerland Conference, Apr '14

Personnel Acknowledgments

Current Students

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

76

Past Research Scientist – S. Sur

Current Post-Docs

– X. Lu

– K. Hamidouche

Current Programmers

– M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

– D. Bureddy