Transcript
Page 1: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie: Towards a Scalable Workflow Management System for Hadoop

Mohammad Islam

And

Virag Kothari

Page 2: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Accepted Paper

• Workshop in ACM/SIGMOD, May 2012.• It is a team effort!

Mohammad Islam Angelo Huang

Mohamed Battisha Michelle Chiang Santhosh Srinivasan Craig Peters

Andreas Neumann Alejandro Abdelnur

Page 3: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Presentation Workflow

Oozie

Tutorial

Design

Decisions

Results

ENDAddress

Question

Questions?

Page 4: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Installing Oozie

Step 2: Unpack the tarballtar –xzvf <PATH_TO_OOZIE_TAR>

Step 3: Run the setup scriptbin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie bin/oozie-start.sh

Step 5: Check status of oozie bin/oozie admin -oozie http://localhost:11000/oozie -status

Step 1: Download the Oozie tarballcurl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/oozie-3.1.3-incubating-distro.tar.gz

Page 5: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Running an Example

<workflow –app name =..><start..><action> <map-reduce> …… ……</workflow>

Workflow.xml

$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir

• Standalone Map-Reduce job

StartMapReducewordcount

EndOK

Kill

ERROR

Example DAG

• Using Oozie

Page 6: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

<action name=’wordcount'>

<map-reduce>

<configuration>

<property>

<name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value>

</property><property>

<name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value>

</property>

<property>

<name>mapred.input.dir</name>

<value>usr/joe/inputDir </value>

</property>

<property>

<name>mapred.output.dir</name>

<value>/usr/joe/outputDir</value>

</property>

</configuration>

</map-reduce>

</action>

Example Workflow

mapred.mapper.class = org.myorg.WordCount.Map

mapred.reducer.class = org.myorg.WordCount.Reduce

mapred.input.dir = inputDir

mapred.output.dir = outputDir

Page 7: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

A Workflow Application

1) Workflow.xml:Contains job definition

2) Libraries: optional ‘lib/’ directory contains .jar/.so files

3) Properties file:• Parameterization of Workflow xml• Mandatory property is oozie.wf.application.path

Three components required for a Workflow:

Page 8: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Workflow Submission

Run Workflow Job

$ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/

Workflow ID: 00123-123456-oozie-wrkf-W

Check Workflow Job Status

$ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/

-----------------------------------------------------------------------Workflow Name: test-wfApp Path: hdfs://localhost:11000/user/your_id/oozie/Workflow job status [RUNNING] ... ------------------------------------------------------------------------

Page 9: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Key Features and Design Decisions

• Multi-tenant• Security

– Authenticate every request– Pass appropriate token to Hadoop job

• Scalability– Vertical: Add extra memory/disk– Horizontal: Add machines

Page 10: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie Job Processing

Oozie Server

Sec

ure

Acc

ess

Had

oop

Kerberos

Oozie Security

End user

Job

Page 11: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie-Hadoop Security

Oozie Server

Sec

ure

A

cces

s

End user

Oozie Security

c

Had

oop

Job Kerberos

Page 12: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie-Hadoop Security

• Oozie is a multi-tenant system• Job can be scheduled to run later• Oozie submits/maintains the hadoop jobs• Hadoop needs security token for each

request

Question: Who should provide the security token to hadoop and how?

Page 13: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie-Hadoop Security Contd.

• Answer: Oozie• How?

– Hadoop considers Oozie as a super-user– Hadoop does not check end-user credential– Hadoop only checks the credential of Oozie

process

• BUT hadoop job is executed as end-user.• Oozie utilizes doAs() functionality of Hadoop.

Page 14: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

User-Oozie Security

Oozie Server

Sec

ure

A

cces

s

End user

Oozie Security

c

Had

oop

KerberosJob

Page 15: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Why Oozie Security?

• One user should not modify another user’s job

• Hadoop doesn’t authenticate end–user• Oozie has to verify its user before passing

the job to Hadoop

Page 16: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

How does Oozie Support Security?

• Built-in authentication– Kerberos– Non-secured (default)

• Design Decision– Pluggable authentication– Easy to include new type of authentication– Yahoo supports 3 types of authentication.

Page 17: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Job Submission to Hadoop

• Oozie is designed to handle thousands of jobs at the same time

• Question : Should Oozie server – Submit the hadoop job directly? – Wait for it to finish?

• Answer: No

Page 18: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Job Submission Contd.

• Reason – Resource constraints: A single Oozie process

can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation)

– Isolation: Running user code on Oozie server might de-stabilize Oozie

• Design Decision– Create a launcher hadoop job– Execute the actual user job from the launcher.– Wait asynchronously for the job to finish.

Page 19: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Job Submission to Hadoop

Oozie Server

Hadoop Cluster

Job Tracker

Launcher Mapper

12

4

Actual M/R Job3

5

Page 20: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Job Submission Contd.

• Advantages – Horizontal scalability: If load increases, add

machines into Hadoop cluster– Stability: Isolation of user code and system

process• Disadvantages

– Extra map-slot is occupied by each job.

Page 21: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Production Setup

• Total number of nodes: 42K+• Total number of Clusters: 25+• Total number of processed jobs ≈ 750K/month• Data presented from two clusters• Each of them have nearly 4K nodes• Total number of users /cluster = 50

Page 22: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie Usage Pattern @ Y!

fs java map-reduce pig 0

5

10

15

20

25

30

35

40

45

50

Distribution of Job Types On Production Clusters

#1 Cluster#2 Cluster

Job type

Pe

rce

nta

ge

• Pig and Java are the most popular• Number of pure Map-Reduce jobs are fewer

Page 23: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Experimental Setup

• Number of nodes: 7• Number of map-slots: 28• 4 Core, RAM: 16 GB• 64 bit RHEL• Oozie Server

– 3 GB RAM– Internal Queue size = 10 K– # Worker Thread = 300

Page 24: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Job Acceptance

2 6 10 14 20 40 52 100 120 200 320 6400

200

400

600

800

1000

1200

1400

Workflow Acceptance Rate

Number of Submission Threads

wo

rkfl

ow

s A

ccep

ted

/Min

Observation: Oozie can accept a large number of jobs

Page 25: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Time Line of a Oozie Job

Time

User submits

Job

Oozie submits to

Hadoop

Job completes at Hadoop

Job completes at Oozie

Preparation Overhead

Completion Overhead

Total Oozie Overhead = Preparation + Completion

Page 26: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie Overhead

1 Action 5 Actions 10 Actions 50 Actions0

200

400

600

800

1000

1200

1400

1600

1800

Per Action Overhead

Number of Actions/Workflow

Ove

rhea

d i

n m

illi

secs

Observation: Oozie overhead is less when multiple actions are in the same workflow.

Page 27: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Oozie Futures

• Scalability– Hot-Hot/Load balancing service– Replace SQL DB with Zookeeper

• Improved Usability• Extend the benchmarking scope• Monitoring WS API

Page 28: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Take Away ..

• Oozie is – Easier to use– Scalable– Secure and multi-tenant

Page 29: May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Q & A

Virag Kothari

[email protected]

Mohammad K Islam [email protected]

http://incubator.apache.org/oozie/


Top Related