accelerating spark with rdma for big data processing: early

29
Xiaoyi Lu, Md. WasiurRahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda NetworkBased Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Upload: phamanh

Post on 13-Feb-2017

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Accelerating Spark with RDMA for Big Data Processing: Early

Xiaoyi  Lu,  Md.  Wasi-­‐ur-­‐Rahman,  Nusrat  Islam,  Dip7  Shankar,  and  Dhabaleswar  K.  (DK)  Panda  

 Network-­‐Based  Compu2ng  Laboratory  

Department  of  Computer  Science  and  Engineering  The  Ohio  State  University,  Columbus,  OH,  USA  

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Page 2: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

2

Page 3: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)

Digital Data Explosion in the Society

3

Page 4: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Big Data Technology - Hadoop

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data

Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job

Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

•  Apache Hadoop is one of the most popular Big Data technology –  Provides frameworks for large-scale, distributed data storage and

processing –  MapReduce, HDFS, YARN, RPC, etc.

4

Page 5: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

Big Data Technology - Spark  •  An  emerging  in-­‐memory  

processing  framework  –  Itera7ve  machine  learning    –  Interac7ve  data  analy7cs  –  Scala  based  Implementa7on  –  Master-­‐Slave;  HDFS,  Zookeeper  

•  Scalable  and  communica7on  intensive  –  Wide  dependencies  between  Resilient  Distributed  Datasets  (RDDs)  

–  MapReduce-­‐like  shuffle  opera7ons  to  repar77on  RDDs    

–  Same as Hadoop, Sockets-based communication  

5 Narrow Dependencies Wide dependencies

= Partition = Data Block= RDD

Page 6: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Common  Protocols  using  Open  Fabrics  

Applica-on  

Verbs  Sockets  

Applica7on  Interface  

SDP

RDMA  

SDP  

InfiniBand  Adapter  

InfiniBand  Switch  

RDMA  

IB  Verbs  

InfiniBand  Adapter  

InfiniBand  Switch  

User space

RDMA  

RoCE  

RoCE  Adapter  

User space

Ethernet  Switch  

TCP/IP  

Ethernet  Driver  

Kernel Space

Protocol  

InfiniBand  Adapter  

InfiniBand  Switch  

IPoIB  

Ethernet  Adapter  

Ethernet  Switch  

Adapter  

Switch  

1/10/40  GigE  

iWARP  

Ethernet  Switch  

iWARP  

iWARP  Adapter  

User space IPoIB  

TCP/IP  

Ethernet  Adapter  

Ethernet  Switch  

10/40  GigE-­‐TOE  

Hardware  Offload  

RSockets  

InfiniBand  Adapter  

InfiniBand  Switch  

User space

RSockets  

6

Page 7: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Previous Studies •  Very  good  performance  improvements  for  Hadoop  (HDFS

/MapReduce/RPC),  HBase,  Memcached  over  InfiniBand –  Hadoop Acceleration with RDMA

•  N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC’14

•  N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC’12

•  M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS’14

•  M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC’13

•  X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP’13

–  HBase Acceleration with RDMA •  J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,

IPDPS’12

–  Memcached Acceleration with RDMA •  J. Jose, et.al., Memcached Design on High Performance RDMA Capable

Interconnects, ICPP’11 7

Page 8: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

•  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

•  RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

•  RDMA for Memcached (RDMA-Memcached)

•  OSU HiBD-Benchmarks (OHB) •  http://hibd.cse.ohio-state.edu

The High-Performance Big Data (HiBD) Project

8

Page 9: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

•  High-Performance Design of Hadoop over RDMA-enabled Interconnects

–  High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

–  Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

•  Current release: 0.9.9 (03/31/14) –  Based on Apache Hadoop 1.2.1

–  Compliant with Apache Hadoop 1.2.1 APIs and applications

–  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR)

•  RoCE support with Mellanox adapters

•  Various multi-core platforms

•  Different file systems with disks and SSDs

•  RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!

RDMA for Apache Hadoop

9

http://hibd.cse.ohio-state.edu

Page 10: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

0 50

100 150 200 250 300 350 400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

0 20 40 60 80

100 120 140 160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

•  RandomWriter  in  a  cluster  of  single  SSD  per  node  

–  16%  improvement  over  IPoIB  for  50GB  in  a  cluster  of  16  nodes  

–  20%  improvement  over  IPoIB  for  300GB  in  a  cluster  of  64  nodes  

Reduced by 20% Reduced by 36%

•  Sort  in  a  cluster  of  single  SSD  per  node  

–  20%  improvement  over  IPoIB  for  50GB  in  a  cluster  of  16  nodes  

–  36%  improvement  over  IPoIB  for  300GB  in  a  cluster  of  64  nodes  

10

Can RDMA benefit Apache Spark

on High-Performance Networks also?

Page 11: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

11

Page 12: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Problem Statement •  Is it worth it?

–  Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark?

–  A few percentage points or orders of magnitude? •  How difficult is it to adapt RDMA to Spark?

–  Can RDMA be adapted to suit the communication needs of Spark?

–  Is it viable to have to rewrite portions of the Spark code with RDMA?

•  Can Spark applications benefit from an RDMA-enhanced design? –  What are the performance benefits that can be achieved by using

RDMA for Spark applications on modern HPC clusters? –  Can RDMA-based design benefit applications transparently?

12

Page 13: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Assessment of Performance Improvement Potential

•  How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols?

•  Assessment Methodology –  Evaluation on primitive-level micro-benchmarks

•  Latency, Bandwidth •  1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) •  Java/Scala-based environment

–  Evaluation on typical workloads •  GroupByTest •  1GigE, 10GigE, IPoIB (32Gbps) •  Spark 0.9.1

13

Page 14: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Evaluation on Primitive-level Micro-benchmarks

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

5

10

15

20

25

30

35

40

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (m

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

500

1,000

1,500

2,000

2,500

3,000

1GigE 10GigE IPoIB(32Gbps)

RDMA(32Gbps)

Band

wid

th (M

Bps)

Small Message Large Message

•  Obviously, IPoIB and 10GigE are much better than 1GigE.

•  Compared to other interconnects/protocols, RDMA can significantly –  reduce the latencies for all the

message sizes –  improve the peak bandwidth

14

Can these benefits of High-Performance

Networks be achieved in Apache Spark?

Page 15: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Evaluation on Typical Workloads •  For High-Performance Networks,

–  The execution time of GroupBy is significantly improved

–  The network throughputs are much higher for both send and receive

•  Spark can benefit from high-performance interconnects to a greater extent!

0

5

10

15

20

25

30

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

65%

22%

1GigE10GigEIPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

15 Network Throughput in Recv Network Throughput in Send

Can RDMA further benefit Spark

performance compared with IPoIB and 10GigE?

Page 16: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

16

Page 17: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Architecture Overview •  Design Goals

–  High Performance –  Keeping the existing

Spark architecture and interface intact

–  Minimal code changes

Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)•  Approaches

–  Plug-in based approach to extend shuffle framework in Spark •  RDMA Shuffle Server + RDMA Shuffle Fetcher •  100 lines of code changes inside Spark original files

–  RDMA-based Shuffle Engine 17

Page 18: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

SEDA-based Data Shuffle Plug-ins •  SEDA - Staged Event-Driven

Architecture •  A set of stages connected by

queues •  A dedicated thread pool will be in

charge of processing events on the corresponding queue

–  Listener, Receiver, Handlers, Responders

•  Performing admission controls on these event queues

•  High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark

Request Queue

ResponseQueue

HandlerListener Receiver

RDMA-based Shuffle Engine Block Data

Responder

Receive Request Send

Response Read DataAccept

Connection

Request Enqeue

ResponseEnqueue

Request Dequeue

Connection Pool

ResponseDequeue

18

Page 19: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Connection Management

–  Alternative designs •  Pre-connection

–  Hide the overhead in the initialization stage –  Before the actual communication, pre-connect processes to each other –  Sacrifice more resources to keep all of these connections alive

•  Dynamic Connection Establishment and Sharing –  A connection will be established if and only if an actual data exchange

is going to take place –  A naive dynamic connection design is not optimal, because we need to

allocate resources for every data block transfer –  Advanced dynamic connection scheme that reduces the number of

connection establishments –  Spark uses multi-threading approach to support multi-task execution in

a single JVM à Good chance to share connections! –  How long should the connection be kept alive for possible re-use? –  Time out mechanism for connection destroy

19

Page 20: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Data Transfer

–  Each connection is used by multiple tasks (threads) to transfer data concurrently

–  Packets over the same communication lane will go to different entities in both server and fetcher sides

–  Alternative designs •  Perform sequential transfers of complete blocks over a

communication lane à Keep the order –  Cause long wait times for some tasks that are ready to transfer data

over the same connection •  Non-blocking and Out-of-order Data Transfer

–  Chunking data blocks –  Non-blocking sending over shared connections –  Out-of-order packet communication –  Guarantee both performance and ordering –  Efficiently work with the dynamic connection management and sharing

mechanism 20

Page 21: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Buffer Management

–  On-JVM-Heap vs. Off-JVM-Heap Buffer Management –  Off-JVM-Heap

•  High-Performance through native IO •  Shadow buffers in Java/Scala •  Registered for RDMA communication

–  Flexibility for upper layer design choices •  Support connection sharing mechanism à Request ID •  Support packet processing in order à Sequence number •  Support non-blocking send à Buffer flag + callback

21

Page 22: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

22

Page 23: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Experimental Setup •  Hardware  

–  Intel  Westmere  Cluster  (A)  • Up  to  9  nodes  • Each  node  has  8  processor  cores  on  2  Intel  Xeon  2.67  GHz  quad-­‐core  CPUs,  24  GB  main  memory  

• Mellanox  QDR  HCAs  (32Gbps)  +  10GigE  –  TACC  Stampede  Cluster  (B)    

• Up  to  17  nodes  •  Intel  Sandy  Bridge  (E5-­‐2680)  dual  octa-­‐core  processors,  running  at  2.70GHz,  32  GB  main  memory  

• Mellanox  FDR  HCAs  (56Gbps)  •  Soiware  

–  Spark  0.9.1,  Scala  2.10.4    and  JDK  1.7.0  –  GroupBy  Test   23

Page 24: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Evaluation on Cluster A

•  For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE

•  For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE

0 1 2 3 4 5 6 7 8 9

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

18%

10GigEIPoIB (32Gbps)RDMA (32Gbps)

0

2

4

6

8

10

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

20%10GigEIPoIB (32Gbps)RDMA (32Gbps)

4  Worker  Nodes,  32  Cores,  (32M  32R)   8  Worker  Nodes,  64  Cores,  (64M  64R)  

24

Page 25: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Evaluation on Cluster B

•  For 128 cores, up to 83% over IPoIB (56Gbps) •  For 256 cores, up to 79% over IPoIB (56Gbps)

8  Worker  Nodes,  128  Cores,  (128M  128R)   16  Worker  Nodes,  256  Cores,  (256M  256R)  

0

10

20

30

40

50

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

83%IPoIB (56Gbps)RDMA (56Gbps)

0

5

10

15

20

25

30

35

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

79%IPoIB (56Gbps)RDMA (56Gbps)

25

Page 26: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Analysis on Cluster B •  Java-based micro-benchmark comparison

IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps

•  Benefit of RDMA Connection Sharing Design

0

5

10

15

20

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

55%

RDMA−w/o−Conn−Sharing (56Gbps)RDMA−w−Conn−Sharing (56Gbps)

By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes

26

Page 27: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

27

Page 28: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Conclusion and Future Work

•  Three major conclusions

–  RDMA and high-performance interconnects can benefit Spark.

–  Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity.

–  Spark applications can benefit from an RDMA-enhanced design.

•  Future Work –  Continuously update this package with newer designs and carry

out evaluations with more Spark applications on systems with high-performance networks

–  Will make this design publicly available through the HiBD project 28

Page 29: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

     Thank  You!  {luxi,  rahmanmd,  islamn,  shankard,  panda}  

@cse.ohio-­‐state.edu    

Network-­‐Based  Compu7ng  Laboratory  hrp://nowlab.cse.ohio-­‐state.edu/

The  High-­‐Performance  Big  Data  Project  hrp://hibd.cse.ohio-­‐state.edu/