dynamic resource allocation spark on yarn

19
Dynamic Resource Allocation for Spark on YARN [email protected] Tsuyoshi Ozawa

Upload: tsuyoshi-ozawa

Post on 15-Aug-2015

552 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Dynamic Resource Allocation Spark on YARN

Dynamic Resource Allocation for Spark on

[email protected] Tsuyoshi Ozawa

Page 2: Dynamic Resource Allocation Spark on YARN

What’s YARN

• A resource manager implementationfor computer cluster

Page 3: Dynamic Resource Allocation Spark on YARN

Hadoop Stack

HDFS

YARN

MapReduceSpark Tez

Page 4: Dynamic Resource Allocation Spark on YARN

YARN overview• All resources are managed by ResourceManager

• All tasks are launched on NodeManager

• Client submit jobs via ResourceManager

NodeManager NodeManager

ResourceManager client

Page 5: Dynamic Resource Allocation Spark on YARN

Spark on YARN• 2 mode

• yarn-cluster

• yarn-client

Page 6: Dynamic Resource Allocation Spark on YARN

yarn-cluster mode• Launching Spark driver on YARN container

• Working well with spark-submit

NodeManager NodeManager NM

container1 container2Spark AppMaster

clientResource Manager 1 submit

2 launching master

3 launching executers

spark driver

Page 7: Dynamic Resource Allocation Spark on YARN

yarn-client mode• Launching Spark driver at client side

• Working well with spark-shell

NodeManager NodeManager NM

container1 container2Spark AppMaster

clientResource Manager 1 submit

2 launching master

3 launching executers spark driver

4. send commands

Page 8: Dynamic Resource Allocation Spark on YARN

Spark on YARN• yarn-cluster mode

Node1 Node2 Node3

container1

container2

AppMaster container2

Page 9: Dynamic Resource Allocation Spark on YARN

Problem• Inefficient resource management

• containers cannot exit until job exits

Node1 Node2

container container container container

stage1

stage2

100% 100% 100% 100%

100% 0% 0% 0%

Page 10: Dynamic Resource Allocation Spark on YARN

Dynamic resource allocation(since v1.2)

• Allocating containers more dynamically

• number of executers are decided by workload

NodeManager NodeManager NM

container1 container2Spark AppMaster

clientResource Manager 1 submit

2 launching master

3 launching executers/

kill executors

spark driver

Page 11: Dynamic Resource Allocation Spark on YARN

Yak shaving• Where should we hold the state of Spark RDD?

• If executers are killed, it’ll be lost…

NodeManager

executer executerRDD RDD

Page 12: Dynamic Resource Allocation Spark on YARN

external shuffle • Saving Spark RDD to NodeManager

• NodeManager has a interface, external shuffle plugin

• Now executers are stateless!

NodeManager

executer executerexternal

shuffle plugin

RDD (IntermediateFile)

RDD (IntermediateFile)

Page 13: Dynamic Resource Allocation Spark on YARN

How to install (with Apache Hadoop)

• Copy shuffle plugin to nodemanager’s classpath

• Edit yarn-site.xml

• Edit spark-defaults.conf

Page 14: Dynamic Resource Allocation Spark on YARN

Copy shuffle jar to nodemanager’s classpath

$ cp \ lib/spark-*-yarn-shuffle.jar \ /home/ubuntu/hadoop/share/hadoop/yarn/

Page 15: Dynamic Resource Allocation Spark on YARN

Edit yarn-site.xml• Adding shuffle plugin

• Note that documentation for 1.2 includes typo - I PRed :-)

• See documentation for 1.4

Page 16: Dynamic Resource Allocation Spark on YARN

Edit spark-defaults.conf

Page 17: Dynamic Resource Allocation Spark on YARN

We’re ready!!

• num-executers are defined automatically

Page 18: Dynamic Resource Allocation Spark on YARN

Demo

Page 19: Dynamic Resource Allocation Spark on YARN

Summary• Spark on YARN

• yarn-client mode

• yarn-cluster mode

• Spark can launch jobs efficiently on YARN with dynamic allocation