high performance rdma based design for big data and · pdf filehigh performance rdma based...

High Performance RDMA based Design for Big Data and Web 2.0 Memcached

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Talk at OFA Developer Workshop (2013)

by

http://www.cse.ohio-state.edu/%7Epanda

Big Data

• Provides groundbreaking opportunities for enterprise information management and decision making

• The rate of information growth appears to be exceeding Moore’s Law

• The amount of data is exploding; companies are capturing and digitizing more information that ever

• 35 zettabytes of data will be generated and consumed by the end of this decade Courtesy: John Gantz and David Reinsel, "The Digital Universe Decade

– Are You Ready?" May 2010. http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

2 OFA Developer Workshop, April '13

http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

• Message Passing Interface (MPI)

• Parallel File Systems

• Storage Systems

• Almost 12.5 years of Research and Development since InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take advantage of High-Performance Networks – UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

3

Use of High-Performance Networks for Scientific Computing

OFA Developer Workshop, April '13

• Focuses on large data and data analysis

• Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum

• Memcached is also used for Web 2.0

4

Big Data/Enterprise/Commercial Computing


Can High-Performance Interconnects Benefit Enterprise Computing?

• Most of the current enterprise systems use 1GE

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest – Oracle, IBM, Google are working along these directions

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?


• Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached

• Challenges in Accelerating Enterprise Middleware • Designs and Case Studies

– Hadoop • HDFS • MapReduce • HBase • Combination (HDFS + HBase)

– Memcached

• Conclusion and Q&A

Presentation Outline


Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• http://hadoop.apache.org/ • Framework includes:

MapReduce, HDFS, and HBase • Underlying Hadoop Distributed

File System (HDFS) used by MapReduce and HBase

• Model scales but high amount of communication during intermediate phases


Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop; highly reliable and fault-tolerant

• Adopted by many reputed organizations – eg:- Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses sockets for communication!

(HDFS Architecture)

OFA Developer Workshop, April '13 8


HBase

• Apache Hadoop Database (http://hbase.apache.org/)

• Semi-structured database, which is highly scalable

• Integral part of many datacenter applications – eg:- Facebook Social Inbox

• Developed in Java for platform-independence and portability

• Uses sockets for communication!

(HBase Architecture)

Memcached Architecture

• Integral part of Web 2.0 architecture • Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

• Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive




– Hadoop • HDFS • MapReduce • HBase • Combination (HDFS and HBase)

– Memcached




Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

12

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface


Interplay between Storage and Interconnect/Protocols

• Most of the current generation enterprise systems use the traditional hard disks

• Since hard disks are slower, high performance communication protocols may not have impact

• SSDs and other storage technologies are emerging

• Does it change the landscape?





– Memcached




HDFS-RDMA Design Overview

• JNI Layer bridges Java based HDFS with communication library written in native code • Only the communication part of HDFS Write is modified; No change in HDFS architecture

15

HDFS

IB Verbs

InfiniBand

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

HDFS Write involves replication; more network intensive HDFS Read is mostly node-local

Enables high performance RDMA communication, while supporting traditional socket interface


Communication Times in HDFS

• Cluster B with 32 HDD DataNodes

– 30% improvement in communication time over IPoIB (32Gbps)

– 87% improvement in communication time over 1GigE

• Similar improvements are obtained for SSD DataNodes

0

10

20

30

40

50

60

70

80

90

100

2GB 4GB 6GB 8GB 10GB

Com

mun

icat

ion

Tim

e (s

)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012


Evaluations using TestDFSIO

• Cluster with 32 HDD DataNodes

– 13.5% improvement over IPoIB (32Gbps) for 8GB file size

• Cluster with 4 SSD DataNodes

– 15% improvement over IPoIB (32Gbps) for 8GB file size

0

50

100

150

200

250

300

350

2 4 6 8 10

Thro

ughp

ut (M

egaB

ytes

/s)

File Size (GB)

1GigE IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

2 4 6 8 10

Thro

ughp

ut (M

egaB

ytes

/s)

File Size (GB)


Cluster with 32 HDD Nodes Cluster with 4 SSD Nodes

Increased by 13.5%

Increased by 15%


Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node

– 10% improvement over IPoIB (32Gbps) for 10GB file size

• Cluster with 4 DataNodes, 2 HDD per node

– 16.2% improvement over IPoIB (32Gbps) for 10GB file size

• 2 HDD vs 1 HDD

– 2.01x improvement for OSU-IB (32Gbps)

– 1.8x improvement for IPoIB (32Gbps)

Increased by 16.2%

0

50

100

150

200

250

300

350

2 4 6 8 10

Aver

age

Thro

ughp

ut (M

egaB

ytes

/s)

File Size (GB)

1GigE (1HDD) IPoIB(1HDD) OSU-IB (1HDD)1GigE (2HDD) IPoIB (2HDD) OSU-IB (2HDD)


Evaluations using YCSB (32 Region Servers: 100% Update)

Put Latency Put Throughput

• HBase using TCP/IP, running over HDFS-IB

• HBase Put latency for 480K records

– 201 us for OSU Design; 272 us for IPoIB (32Gbps)

• HBase Put throughput for 480K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

• 26% improvement in average latency; 24% improvement in throughput OFA Developer Workshop, April '13 19

0

50

100

150

200

250

300

350

400

120K 240K 360K 480K

Aver

age

Late

ncy

(us)

Number of Records (Record Size = 1KB)


00.5

11.5

22.5

33.5

44.5

5

120K 240K 360K 480K

Thro

ughp

ut (K

ops/

sec)






– Memcached





MapReduce-RDMA Design Overview

• Enables high performance RDMA communication, while supporting traditional socket interface

RDMA-enabled MapReduce

IB Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE Network



Job Tracker

Task Tracker

Map

Reduce

M.-W Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang and D. K. Panda, High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, Int'l Workshop on High Performance Data Intensive Computing (HPDIC), held in conjunction with Int'l Parallel and Distributed Processing Symposium (IPDPS '13), May 2013.

• For 5-20 GB Sort, 4-node cluster with a single SSD per node

• For 25-40 GB Sort, 8-node cluster with a single HDD per node

• 46% improvement over IPoIB (32 Gbps) for 15 GB Sort

• 27% improvement over IPoIB (32 Gbps) for 40 GB Sort

Evaluations using Sort

0

50

100

150

200

250

300

350

400

450

500

5 10 15 20

Exec

utio

n Ti

me

(sec

)

Sort Size (GB)


0

200

400

600

800

1000

1200

1400

25 30 35 40

Exec

utio

n Ti

me

(sec

)

Sort Size (GB)



• Cluster Size 4 and 8 have 24 GB RAM in each node, Cluster Size 12 and 24 have 12 GB RAM in each node, all the nodes have single HDD

• 41% improvement over IPoIB (32Gbps) for 100 GB Terasort

Evaluations using TeraSort

0

500

1000

1500

2000

2500

3000

30 GB 60 GB 100 GB 200 GB

Cluster Size = 4 Cluster Size = 8 Cluster Size = 12 Cluster Size = 24

Exec

utio

n Ti

me

(sec

) 1GigE IPoIB (QDR) OSU-IB (QDR)


• The DataSet for these workloads are taken from PUMA (Purdue MapReduce Benchmark Suite)

• 46% improvement in Adjacency List over IPoIB (32Gbps) for 30 GB data size

• 33% improvement in Sequence Count over IPoIB (32 Gbps) for 50 GB data size

Evaluations using PUMA Workload

0

500

1000

1500

2000

2500

3000

Adjacency List (30GB) Self Join (30 GB) Word Count (50 GB) Sequence Count (50 GB) Inverted Index (50 GB)

Exec

utio

n Ti

me

(sec

)

Workloads

IPoIB (QDR) OSU-IB (QDR)





– Memcached





HBase Put/Get – Detailed Analysis

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

0

50

100

150

200

250

300

1GigE IPoIB 10GigE OSU-IB

Tim

e (u

s)

HBase Put 1KB

0

50

100

150

200

250

1GigE IPoIB 10GigE OSU-IB

Tim

e (u

s)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12


HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code

• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE Network




HBase Single Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

100

200

300

400

500

600

1 2 4 8 16

Tim

e (u

s)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Ops

/sec

No. of Clients

IPoIB

OSU-IB

1GigE

10GigE

Latency Throughput


HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients

• HBase Get latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (u

s)

No. of Clients

IPoIB OSU-IB

1GigE 10GigE

Read Latency Write Latency

J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12




– Memcached




• YCSB Evaluation with 4 RegionServers (100% update)

• HBase Put Latency and Throughput for 360K records

- 37% improvement over IPoIB (32Gbps)

- 18% improvement over OSU-IB HDFS only

HDFS and HBase Integration over IB (OSU-IB)

0

50

100

150

200

250

300

350

400

450

120K 240K 360K 480K

Late

ncy

(us)


1GigE IPoIB (QDR)

OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR)

0

1

2

3

4

5

6

120K 240K 360K 480K

Thro

ughp

ut (K

ops/

sec)


1GigE IPoIB (QDR)

OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR)





– Memcached




Memcached-RDMA Design

33

• Memcached applications need not be modified; uses verbs interface if available

• Memcached Server can serve both socket and verbs clients simultaneously

• Native IB-verbs-level Design and evaluation with – Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

…

1

1

2

2


34

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)



Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Thou

sand

s of T

rans

actio

ns p

er se

cond

(TPS

)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Thou

sand

s of T

rans

actio

ns p

er se

cond

(TPS

)

No. of Clients


Application Level Evaluation – Olio Benchmark

• Olio Benchmark – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (m

s)

No. of Clients

0

20

40

60

80

100

120

1 4 8

Tim

e (m

s)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB


Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (m

s)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (m

s)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, HBase and Memcached

• Results are promising

• Working on Integrated designs of all components

• Many other open issues need to be solved including design changes at the upper layers

• Will enable BigData community to take advantage of modern clusters and Hadoop middleware to carry out their analytics in a fast and scalable manner


Concluding Remarks

38


Web Pointers

http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu

MVAPICH Web Page http://mvapich.cse.ohio-state.edu

[email protected]

39





http://nowlab.cse.ohio-state.edu/




http://mvapich.cse.ohio-state.edu/



mailto:[email protected]

high performance rdma based design for big data and · pdf filehigh performance rdma based...

Documents