workflow management cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Workflow Management

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

APACHE OOZIE

Problem!

• "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person– Package jobs?– Chaining actions together?– Run these on a schedule?– Pre and post processing?– Retry failures?

Apache OozieWorkflow Scheduler for Hadoop

• Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs

• Workflow jobs are DAGs of actions• Coordinator jobs are recurrent Oozie Workflow

jobs triggered by time and data availability• Supports several types of jobs:• Java MapReduce

• Streaming MapReduce• Pig• Hive

• Sqoop• Distcp• Java programs• Shell scripts

Why should I care?

• Retry jobs in the event of a failure• Execute jobs at a specific time or when data is

available• Correctly order job execution based on

dependencies• Provide a common framework for

communication• Use the workflow to couple resources instead

of some home-grown code base

Layers of Oozie

• Bundles• Coordinators• Workflows• Actions

Actions

• Have a type, and each type has a defined set of configuration variables

• Each action must specify what to do based on success or failure

Workflow DAGs

start JavaMain

M/Rstreaming

job

decision

fork

Pigjob

M/Rjob

joinOK

OK

OK

OK

end

Java Main

FSjob

OK OK

ENOUGH

MORE

Workflow LanguageFlow-control Node DescriptionDecision Expressing “switch-case” logic

Fork Splits one path of execution into multiple concurrent pathsJoin Waits until every concurrent execution path of a previous fork node arrives to itKill Forces a workflow job to abort execution

Action Node Descriptionjava Invokes the main() method from the specified java classfs Manipulate files and directories in HDFS; supports commands: move, delete, mkdirMapReduce Starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job

Pig Runs a Pig jobSub workflow Runs a child workflow job

Hive Runs a Hive jobShell Runs a Shell commandssh Starts a shell command on a remote machine as a remote secure shell

Sqoop Runs a Sqoop jobEmail Sending emails from Oozie workflow applicationDistcp Runs a Hadoop Distcp MapReduce jobCustom Does what you program it to do

Oozie Workflow Application

• An HDFS Directory containing:

• Definition file: workflow.xml• Configuration file: config-default.xml• App files: lib/ directory with JAR and other

dependencies

WordCount Workflow<workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/></workflow-app>

StartM-R

wordcount EndOKStart

Kill

Error

Coordinators

• Oozie executes workflows based on– Time Dependency– Data Dependency

Hadoop

Tomcat

Oozie Client

Oozie Workflow

WS API Oozie Coordinator

Check Data Availability

Bundle

• Bundles are higher-level abstractions that batch a set of coordinators together

• No explicit dependencies between them, but they can be used to define a pipeline

Interacting with Oozie

• Read-Only Web Console• CLI• Java client• Web Service Endpoints• Directly with Oozie DB using SQL

Job Tracker

<workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node">

<error to="fail"/> </action> ...

...<action name="email-node">

<email xmlns="uri:oozie:email-action:0.1"><to>[email protected]</to><cc>[email protected]</cc><subject>Email notification</subject><body>The wf completed</body>

</email><ok to="myotherjob"/><error to="errorcleanup"/>

</action><end name="end"/><kill name="fail"/>

</workflow-app>

1

23

<?xml version="1.0"?><coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>[email protected]</sla:alert-contact> </sla:info> </action></coordinator-app>

4, 5

6

Review

• Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff

• Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow

• XML is gross

References

• http://oozie.apache.org• https://

cwiki.apache.org/confluence/display/OOZIE/Index

• http://www.slideshare.net/mattgoeke/oozie-riot-games

• http://www.slideshare.net/mislam77/oozie-sweet-13451212

• http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

http://oozie.apache.org/

https://cwiki.apache.org/confluence/display/OOZIE/Index



http://www.slideshare.net/mattgoeke/oozie-riot-games



http://www.slideshare.net/mislam77/oozie-sweet-13451212



http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie




workflow management cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Documents