Transcript
Page 1: Accelerating Spark with RDMA for Big Data Processing: Early

Xiaoyi  Lu,  Md.  Wasi-­‐ur-­‐Rahman,  Nusrat  Islam,  Dip7  Shankar,  and  Dhabaleswar  K.  (DK)  Panda  

 Network-­‐Based  Compu2ng  Laboratory  

Department  of  Computer  Science  and  Engineering  The  Ohio  State  University,  Columbus,  OH,  USA  

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Page 2: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

2

Page 3: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)

Digital Data Explosion in the Society

3

Page 4: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Big Data Technology - Hadoop

Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data

Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job

Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

•  Apache Hadoop is one of the most popular Big Data technology –  Provides frameworks for large-scale, distributed data storage and

processing –  MapReduce, HDFS, YARN, RPC, etc.

4

Page 5: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

Big Data Technology - Spark  •  An  emerging  in-­‐memory  

processing  framework  –  Itera7ve  machine  learning    –  Interac7ve  data  analy7cs  –  Scala  based  Implementa7on  –  Master-­‐Slave;  HDFS,  Zookeeper  

•  Scalable  and  communica7on  intensive  –  Wide  dependencies  between  Resilient  Distributed  Datasets  (RDDs)  

–  MapReduce-­‐like  shuffle  opera7ons  to  repar77on  RDDs    

–  Same as Hadoop, Sockets-based communication  

5 Narrow Dependencies Wide dependencies

= Partition = Data Block= RDD

Page 6: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Common  Protocols  using  Open  Fabrics  

Applica-on  

Verbs  Sockets  

Applica7on  Interface  

SDP

RDMA  

SDP  

InfiniBand  Adapter  

InfiniBand  Switch  

RDMA  

IB  Verbs  

InfiniBand  Adapter  

InfiniBand  Switch  

User space

RDMA  

RoCE  

RoCE  Adapter  

User space

Ethernet  Switch  

TCP/IP  

Ethernet  Driver  

Kernel Space

Protocol  

InfiniBand  Adapter  

InfiniBand  Switch  

IPoIB  

Ethernet  Adapter  

Ethernet  Switch  

Adapter  

Switch  

1/10/40  GigE  

iWARP  

Ethernet  Switch  

iWARP  

iWARP  Adapter  

User space IPoIB  

TCP/IP  

Ethernet  Adapter  

Ethernet  Switch  

10/40  GigE-­‐TOE  

Hardware  Offload  

RSockets  

InfiniBand  Adapter  

InfiniBand  Switch  

User space

RSockets  

6

Page 7: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Previous Studies •  Very  good  performance  improvements  for  Hadoop  (HDFS

/MapReduce/RPC),  HBase,  Memcached  over  InfiniBand –  Hadoop Acceleration with RDMA

•  N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC’14

•  N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC’12

•  M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS’14

•  M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC’13

•  X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP’13

–  HBase Acceleration with RDMA •  J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,

IPDPS’12

–  Memcached Acceleration with RDMA •  J. Jose, et.al., Memcached Design on High Performance RDMA Capable

Interconnects, ICPP’11 7

Page 8: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

•  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

•  RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

•  RDMA for Memcached (RDMA-Memcached)

•  OSU HiBD-Benchmarks (OHB) •  http://hibd.cse.ohio-state.edu

The High-Performance Big Data (HiBD) Project

8

Page 9: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

•  High-Performance Design of Hadoop over RDMA-enabled Interconnects

–  High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components

–  Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB)

•  Current release: 0.9.9 (03/31/14) –  Based on Apache Hadoop 1.2.1

–  Compliant with Apache Hadoop 1.2.1 APIs and applications

–  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR)

•  RoCE support with Mellanox adapters

•  Various multi-core platforms

•  Different file systems with disks and SSDs

•  RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!

RDMA for Apache Hadoop

9

http://hibd.cse.ohio-state.edu

Page 10: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

0 50

100 150 200 250 300 350 400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

0 20 40 60 80

100 120 140 160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exec

utio

n Ti

me

(sec

)

IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

•  RandomWriter  in  a  cluster  of  single  SSD  per  node  

–  16%  improvement  over  IPoIB  for  50GB  in  a  cluster  of  16  nodes  

–  20%  improvement  over  IPoIB  for  300GB  in  a  cluster  of  64  nodes  

Reduced by 20% Reduced by 36%

•  Sort  in  a  cluster  of  single  SSD  per  node  

–  20%  improvement  over  IPoIB  for  50GB  in  a  cluster  of  16  nodes  

–  36%  improvement  over  IPoIB  for  300GB  in  a  cluster  of  64  nodes  

10

Can RDMA benefit Apache Spark

on High-Performance Networks also?

Page 11: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

11

Page 12: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Problem Statement •  Is it worth it?

–  Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark?

–  A few percentage points or orders of magnitude? •  How difficult is it to adapt RDMA to Spark?

–  Can RDMA be adapted to suit the communication needs of Spark?

–  Is it viable to have to rewrite portions of the Spark code with RDMA?

•  Can Spark applications benefit from an RDMA-enhanced design? –  What are the performance benefits that can be achieved by using

RDMA for Spark applications on modern HPC clusters? –  Can RDMA-based design benefit applications transparently?

12

Page 13: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Assessment of Performance Improvement Potential

•  How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols?

•  Assessment Methodology –  Evaluation on primitive-level micro-benchmarks

•  Latency, Bandwidth •  1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) •  Java/Scala-based environment

–  Evaluation on typical workloads •  GroupByTest •  1GigE, 10GigE, IPoIB (32Gbps) •  Spark 0.9.1

13

Page 14: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Evaluation on Primitive-level Micro-benchmarks

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (u

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

5

10

15

20

25

30

35

40

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (m

s)

Message Size (bytes)

1GigE10GigE

IPoIB (32Gbps)RDMA (32Gbps)

0

500

1,000

1,500

2,000

2,500

3,000

1GigE 10GigE IPoIB(32Gbps)

RDMA(32Gbps)

Band

wid

th (M

Bps)

Small Message Large Message

•  Obviously, IPoIB and 10GigE are much better than 1GigE.

•  Compared to other interconnects/protocols, RDMA can significantly –  reduce the latencies for all the

message sizes –  improve the peak bandwidth

14

Can these benefits of High-Performance

Networks be achieved in Apache Spark?

Page 15: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Evaluation on Typical Workloads •  For High-Performance Networks,

–  The execution time of GroupBy is significantly improved

–  The network throughputs are much higher for both send and receive

•  Spark can benefit from high-performance interconnects to a greater extent!

0

5

10

15

20

25

30

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

65%

22%

1GigE10GigEIPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

0

100

200

300

400

500

600

5 10 15 20

Net

wor

k Th

roug

hput

(MB/

s)

Sample Times (s)

1GigE10GigE

IPoIB (32Gbps)

15 Network Throughput in Recv Network Throughput in Send

Can RDMA further benefit Spark

performance compared with IPoIB and 10GigE?

Page 16: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

16

Page 17: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Architecture Overview •  Design Goals

–  High Performance –  Keeping the existing

Spark architecture and interface intact

–  Minimal code changes

Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)•  Approaches

–  Plug-in based approach to extend shuffle framework in Spark •  RDMA Shuffle Server + RDMA Shuffle Fetcher •  100 lines of code changes inside Spark original files

–  RDMA-based Shuffle Engine 17

Page 18: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

SEDA-based Data Shuffle Plug-ins •  SEDA - Staged Event-Driven

Architecture •  A set of stages connected by

queues •  A dedicated thread pool will be in

charge of processing events on the corresponding queue

–  Listener, Receiver, Handlers, Responders

•  Performing admission controls on these event queues

•  High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark

Request Queue

ResponseQueue

HandlerListener Receiver

RDMA-based Shuffle Engine Block Data

Responder

Receive Request Send

Response Read DataAccept

Connection

Request Enqeue

ResponseEnqueue

Request Dequeue

Connection Pool

ResponseDequeue

18

Page 19: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Connection Management

–  Alternative designs •  Pre-connection

–  Hide the overhead in the initialization stage –  Before the actual communication, pre-connect processes to each other –  Sacrifice more resources to keep all of these connections alive

•  Dynamic Connection Establishment and Sharing –  A connection will be established if and only if an actual data exchange

is going to take place –  A naive dynamic connection design is not optimal, because we need to

allocate resources for every data block transfer –  Advanced dynamic connection scheme that reduces the number of

connection establishments –  Spark uses multi-threading approach to support multi-task execution in

a single JVM à Good chance to share connections! –  How long should the connection be kept alive for possible re-use? –  Time out mechanism for connection destroy

19

Page 20: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Data Transfer

–  Each connection is used by multiple tasks (threads) to transfer data concurrently

–  Packets over the same communication lane will go to different entities in both server and fetcher sides

–  Alternative designs •  Perform sequential transfers of complete blocks over a

communication lane à Keep the order –  Cause long wait times for some tasks that are ready to transfer data

over the same connection •  Non-blocking and Out-of-order Data Transfer

–  Chunking data blocks –  Non-blocking sending over shared connections –  Out-of-order packet communication –  Guarantee both performance and ordering –  Efficiently work with the dynamic connection management and sharing

mechanism 20

Page 21: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

RDMA-based Shuffle Engine •  Buffer Management

–  On-JVM-Heap vs. Off-JVM-Heap Buffer Management –  Off-JVM-Heap

•  High-Performance through native IO •  Shadow buffers in Java/Scala •  Registered for RDMA communication

–  Flexibility for upper layer design choices •  Support connection sharing mechanism à Request ID •  Support packet processing in order à Sequence number •  Support non-blocking send à Buffer flag + callback

21

Page 22: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

22

Page 23: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Experimental Setup •  Hardware  

–  Intel  Westmere  Cluster  (A)  • Up  to  9  nodes  • Each  node  has  8  processor  cores  on  2  Intel  Xeon  2.67  GHz  quad-­‐core  CPUs,  24  GB  main  memory  

• Mellanox  QDR  HCAs  (32Gbps)  +  10GigE  –  TACC  Stampede  Cluster  (B)    

• Up  to  17  nodes  •  Intel  Sandy  Bridge  (E5-­‐2680)  dual  octa-­‐core  processors,  running  at  2.70GHz,  32  GB  main  memory  

• Mellanox  FDR  HCAs  (56Gbps)  •  Soiware  

–  Spark  0.9.1,  Scala  2.10.4    and  JDK  1.7.0  –  GroupBy  Test   23

Page 24: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Evaluation on Cluster A

•  For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE

•  For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE

0 1 2 3 4 5 6 7 8 9

4 6 8 10

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

18%

10GigEIPoIB (32Gbps)RDMA (32Gbps)

0

2

4

6

8

10

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

20%10GigEIPoIB (32Gbps)RDMA (32Gbps)

4  Worker  Nodes,  32  Cores,  (32M  32R)   8  Worker  Nodes,  64  Cores,  (64M  64R)  

24

Page 25: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Evaluation on Cluster B

•  For 128 cores, up to 83% over IPoIB (56Gbps) •  For 256 cores, up to 79% over IPoIB (56Gbps)

8  Worker  Nodes,  128  Cores,  (128M  128R)   16  Worker  Nodes,  256  Cores,  (256M  256R)  

0

10

20

30

40

50

8 12 16 20

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

83%IPoIB (56Gbps)RDMA (56Gbps)

0

5

10

15

20

25

30

35

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

79%IPoIB (56Gbps)RDMA (56Gbps)

25

Page 26: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Performance Analysis on Cluster B •  Java-based micro-benchmark comparison

IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps

•  Benefit of RDMA Connection Sharing Design

0

5

10

15

20

16 24 32 40

Gro

upBy

Tim

e (s

ec)

Data Size (GB)

55%

RDMA−w/o−Conn−Sharing (56Gbps)RDMA−w−Conn−Sharing (56Gbps)

By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes

26

Page 27: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Outline    

•  Introduc7on  •  Problem  Statement  

•  Proposed  Design  •  Performance  Evalua7on  

•  Conclusion  &  Future  work

27

Page 28: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

Conclusion and Future Work

•  Three major conclusions

–  RDMA and high-performance interconnects can benefit Spark.

–  Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity.

–  Spark applications can benefit from an RDMA-enhanced design.

•  Future Work –  Continuously update this package with newer designs and carry

out evaluations with more Spark applications on systems with high-performance networks

–  Will make this design publicly available through the HiBD project 28

Page 29: Accelerating Spark with RDMA for Big Data Processing: Early

HotI 2014

     Thank  You!  {luxi,  rahmanmd,  islamn,  shankard,  panda}  

@cse.ohio-­‐state.edu    

Network-­‐Based  Compu7ng  Laboratory  hrp://nowlab.cse.ohio-­‐state.edu/

The  High-­‐Performance  Big  Data  Project  hrp://hibd.cse.ohio-­‐state.edu/


Top Related