ooziehugoct2013-131112012637-phpapp01
TRANSCRIPT
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
1/39
Ooz ie Now and Beyond
! PRESENTED BY Mona Chitnis!Hadoop User Group, Yahoo Sunnyvale, October 16, 2013
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
2/39
Team In Action
2 Yahoo Confidential & Proprietary
!Alejandro Abdelnur! Mohammad Islam! Rohini Palaniswamy! Robert Kanter! Virag Kothari! Mona Chitnis! Ryota Egashira! Michelle Chiang! Bowen Zhang
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
3/39
OVERVIEW
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
4/39
4 Yahoo Confidential & Proprietary
Why Oozie?
The Problem The Need
! Doing something on the grid oftenrequired multiple steps
! MapReduce job! Pig job! Streaming job! HDFS operation (mkdir, chmod, etc)
! Workflow scheduler with better support forgrid jobs (native integration with Hadoop)
! orchestrate dependency between jobs! execute at specific time or on data
availability
! retry jobs in the event of failures(reliable)
! Multiple ad-hoc solutions existed! custom job control! shell scripts! cron
! Common framework for communicationand execution of production process
! sync (clocked dataset) awareness! async (unspecified freq) data
awareness
! Cost of building and running apps werehigh
! development and applicationsengineering
! support, operations, and hardware
! Horizontally scalable and extensiblesystem
! Open-source! Workflows to couple resources instead
of having a monolithic code base
A server-based workflowscheduling system to
manage Hadoop jobs
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
5/39
5 Yahoo Confidential & Proprietary
Oozie A Workflow Engine
! Oozie executes workflow defined as DAG of jobs! The job type includes MapReduce, Pig, Hive, shell script, custom Java code
etc.
! Introduced in Oozie 1.x
startM/Rjob
M/Rjob
decision
fork
Pigjob
M/Rjob
join
end JavaFS
job
ENOUGH
MORE
Control-flow nodes(start, kill, end | fork, join, decision)
Action nodes(map reduce, pig, hive, distcp, java, fs, sub-workflow, shell, ssh, email)
kill
OK
ERROR
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
6/39
Example M/R Action
JT and NN
Mapper
Reducer
Queue Name
Input Directory
Output Directory
6 Yahoo Confidential & Proprietary
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
7/39
7 Yahoo Confidential & Proprietary
Workflow State Transitions
Source: Chicago HUG, Dec 2012
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
8/39
8 Yahoo Confidential & Proprietary
Oozie (Coordinator) A Scheduler
! Oozie executes workflow based on! time dependency (frequency)! data dependency
! Introduced in 2.x
HDFS/ HCat
Oozie Server
OozieClient
OozieWorkflow
WS API Oozie
Coordinator
CheckData Availability
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
9/39
9 Yahoo Confidential & Proprietary
Oozie (Bundle) A Pipeline Framework
! Users can define and execute a bundle of coordinator apps! large scale data processing (inter-related coordinators)! operability and manageability of pipelines
! User can start/stop/suspend/resume/rerun in the bundle level! Introduced in 3.x, bundles are optional
HDFS/ HCat
Oozie Server
OozieClient
OozieWorkflow
WS API
OozieCoordinator
CheckData Availability
Bundle
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
10/39
10 Yahoo Confidential & Proprietary
Layers of Abstraction in Oozie
!""#$
&'()"*
!""#$
&'()"*
!""#$
&'()"*
!""#$
&'()"*
+, -". +, -". +, -".
/01
-".
234
-".
/01
-".
234
-".
!"#$%& 1. Bundle
!""#$ -". !""#$ -".
2. Coordinator
+, -".
3. Workflow
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
11/39
11 Yahoo Confidential & Proprietary
Architectural Overview
Oozie (Java Web-App)
Security
WS CallbackWS API
DAG Engine
Oracle DB
Commands
Command
Queue start rerunsubmitCommand
ExecutorThread Pool
RecoveryDaemon Thread
Action Executors
M/R fsPig
pluggable, to
support additional
action types
Instrumentation
WFstore
WFlib
sub-wf
executed
Asynchronously
via Command Queue
resume killsuspend
info
start
action
end
action
check
action
callback
signal
job
notification
Web Services (JSON/REST API)
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
12/39
12 Yahoo Confidential & Proprietary
Oozie Security, Multi-tenancy and Scalability
Oozie
Server
Hadoop Cluster
YARN
RM
LauncherMapper
ActualM/R Job
1Auth.
End User(Kerberos, Y! specific)
2Create
Launcher Job(super-user)
3ExecuteUser Job(doAs)
5Async Callback
4Response
Overview
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
13/39
USE CASES
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
14/39
14 Yahoo Confidential & Proprietary
Use Case 1: Time Triggers
Execute your workflow every 15 minutes
00:15 00:30 00:45 01:00
Use Cases and Common Patterns
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
15/39
15 Yahoo Confidential & Proprietary
Use Case 2: Time and Data Triggers
Materialize your workflow every hour, but only run them when the inputdata is ready (that is loaded to the grid every hour)
01:00 02:00 03:00 04:00
Hadoop
Input DataExists?
Use Cases and Common Patterns
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
16/39
16 Yahoo Confidential & Proprietary
Use Case 2: Time and Data Triggers
hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}
${current(0)}
hdfs://bar:9000/usr/abc/logsprocessor-wf
inputData${dataIn(inputLogs)}
Use Cases and Common Patterns
Dataset Definition
Input Events Definitionwith time of coordinator action materialized (created)
Action Definition
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
17/39
17 Yahoo Confidential & Proprietary
Use Case 3: Rolling Window
00:15 00:30 00:45 01:00
01:00
01:15 01:30 01:45 02:00
02:00
Access 15 minute datasets and roll them up into hourly datasets
Use Cases and Common Patterns
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
18/39
18 Yahoo Confidential & Proprietary
Use Case 4: Sliding Window
Access last 24 hours of data, and roll them up every hour
01:00 02:00 03:00 24:00
24:00
02:00 03:00 04:00+1 day
01:00
+1 day
01:00
03:00 04:00 05:00+1 day
02:00
+1 day
02:00
Use Cases and Common Patterns
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
19/39
! 17 clusters! 13,000 jobs/server day
! 2.8 M jobs/month! 16% of all Hadoop jobs
! 75 products! 2,000+ projects
! 255 monthly users! 5.4 M compute hrs/month
! 770,000 workflows! Between 1-8 actions! Avg. 4 actions/workflow
! 250 coordinator jobs/day! 67% of Oozie jobs kicked
thru coordinator
Proven Scale and Multi-tenancy
19 Yahoo Confidential & Proprietary
Where are We Today
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
20/39
20 Yahoo Confidential & Proprietary
Mix Of Job Types For Workflows
39%
29%
28%
4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Jobs
Pig MapReduce Java Other
SAMPLE USE OF JOB TYPES
Pig ! Data processing/ filtering! Aggregation
MapReduce! Publishing data (HDFS/HCat)
Java ! Legacy code and logicOthers ! Distcp and shell
! Data copy/ transfer
Where are We Today
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
21/39
FEATURE DEEP-DIVE
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
22/39
22 Yahoo Confidential & Proprietary
Existing Features (Oozie 3.x)
! HBase access through Oozie, via credentials! HCatalog access through Oozie, via credentials! Email action! DistCp action (intra as well as inter-cluster copy)! Shell action (run any script e.g. perl, python, hadoop CLI)! Workflow dry-run & Fork-Join validation! Bulk monitoring (REST API)! Coordinator EL functions for parameterized workflows!
Job DAG
Whats New in Oozie
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
23/39
HBase Credentials
23 Yahoo Confidential & Proprietary
! Add in workflow.xml! Add a section of "credentials". The type is "hbase.! Specify the java action to use the credentials.! Put hbase-site.xml in oozie application path. And use in workflow.xml to put hbase-site.xml in the distributed cache. A copy of the
hbase-site.xml can be found in gateway:/home/gs/conf/hbase/hbase-site.xml.
! Put jars "guava-*.jar, zookeeper-*.jar, hbase-*.jar, protobuf-java-*.jar in workflow lib dir
! Make sure you are using Oozie XSD version 0.3 and above for the tag."#$%&'($#)*++ ,*-./0'$$)#'0 1-(,2/03%45$$64.5#$%&'($#57890:
";%.?*2.)24=.81-(O>?*2.)24=.81-("A'4(.:
"AP*H*:
! Refer to http://twiki.corp.yahoo.com/view/CCDI/UseHbaseCred
Whats New in Oozie
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
24/39
Oozie 4.0
24 Yahoo Confidential & Proprietary
HCatalog Integration
Job Notifications
SLA Monitoring
1
2
3
Whats New in Oozie
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
25/39
HCatalog Integration
! Oozie now supports HCatalog datasets, in addition to HDFS! Query HCat server directly -OR-! Receive partition created notifications
! With HDFS datasets, poll NameNode to check data availability! Delay! Single source
Oozie NameNode
/data/click/2013/03/10/data/click/2013/03/11/data/click/2013/03/12
.
HDFS
data exists?
data exists?
.
Whats New in Oozie
25 Yahoo Confidential & Proprietary
1
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
26/39
! HCat - metastore has info about HDFSdatasets, locations and file formats.
! Using HCat loader and storer, dataset can beconsumed uniformly using Pig, Hive and
Map/Reduce in Oozie, using the database,
table, partition abstraction.
! Oozie notified on partition availability via JMSmessages, to trigger workflows immediately
! Use JARs hcatalog-core.jar, webhcat-java-client.jar, hive-common.jar, hive-exec.jar,
hive-metastore.jar, hive-serde.jar andlibfb303.jar in workflow lib
! Docs -http://oozie.apache.org/docs/4.0.0/DG_HCatalogIntegration.html
";$$%;*=);$$% L G.,.%*=. '$$B ?*%^
2=$%. ] 4,=$ FRbcZXcZdML8RbcZXcZdZYL[\F cI`ef
$%G8*+*;>.8>;*=*($G8+4G8K]*=I=$%.%TFRbcZXcZdXYNZ`Z`beFV^
26 Yahoo Confidential & Proprietary
Latest Oozie 4.0 FeaturesHCatalog Integration
Whats New in Oozie
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
27/39
With HCatalog + NotificationsHigh-level Diagram
HCatalog
Data Producer HDFS
Update metadata(ALTER TABLE click ADD PARTITION(data=2013/03/12)location hdfs://data/click/2013/03/12)
/data/click/2013/03/12
Produce data (distcp, pig, M/R..)
Whats New in Oozie
27 Yahoo Confidential & Proprietary
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
28/39
With HCatalog + NotificationsHigh-level Diagram
Oozie
Message Bus(e..g, ActiveMQ)
HCatalog
2. Register Topic
Data Producer HDFS
1. Query/Poll Partition
Whats New in Oozie
28 Yahoo Confidential & Proprietary
Wh t N i O i
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
29/39
With HCatalog + NotificationsHigh-level Diagram
Oozie
Message Bus(e..g, ActiveMQ)
HCatalog
3. Push notification
2. Register Topic
4. Notify New Partition
Data Producer HDFSProduce data (distcp, pig, M/R..)
/data/click/2013/03/12
1. Query/Poll Partition
Start workflow
Update metadata(ALTER TABLE click ADD PARTITION(data=2013/03/12)location hdfs://data/click/2013/03/12)
Whats New in Oozie
29 Yahoo Confidential & Proprietary
Wh t N i O i
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
30/39
! Notification event sent on jobs status change! Messages sent on the configured JMS-
compliant message broker
! Users should write message listeners to listenon select topics (e.g. username)
! To filter more, apply JMS selectors onmessages.
! E.g. user, jobid, app-type, status, msg-type (JOBor SLA).
! Docs -http://oozie.apache.org/docs/4.0.0/
DG_JMSNotifications.html
Filter desired app-types for notification:"+%$+.%=@:
",*-.:((L>*M/*$*M2N*?&9%?).*.&*
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
31/39
! Oozie can actively track SLAs on Jobs! Start-time, End-time, Duration
! Event Status! START_MET, START_MISS! END_MET, END_MISS! DURATION_MET, DURATION_MISS
! At any time, the SLA processing stage will reflect:! Not_Started
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
32/39
SLA Monitoring Dashboard
32 Yahoo Confidential & Proprietary
Whats New in Oozie
Demo
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
33/39
Checking Oozie Job
33 Yahoo Confidential & Proprietary
1. CLI (yoozie_client)
$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe----------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : hdfs://localhost:8020/user/joe/workflows/map-reduce
Status : SUCCEEDED
Run : 0
User : joeGroup : users
Created : 2009-05-26 05:01
Started : 2009-05-26 05:01
Ended : 2009-05-26 05:01
Actions
---------------------------------------------------------------------------------------------------------------------
Action Name Type Status Transition External Id External Status Error Code Start End------------------------------------------------------------------------------------------------------------------------------------------------------hadoop1 map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 2009-05-26 05:01------------------------------------------------------------------------------------------------------------------------------------------------------
Demo
Demo
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
34/39
Checking / Debugging Oozie Jobs
34 Yahoo Confidential & Proprietary
2. Web-Console
e.g. http://my-oozie-server:4080/oozie
Docs - https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook
Demo
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
35/39
What else is out there?
Oozie at ASF
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
36/39
36 Yahoo Confidential & Proprietary
Oozie vs. Other Workflow Systems
Champion Yahoo! (now ASF) LinkedIn Spotify
ApacheAffiliation
TLP License only License only
Language Java Java Python
AdoptionHigh, part of all standard Hadoopdistributions
Low Low
CodeComplexity
High (>100K lines) Medium (< 50K lines) Low (
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
37/39
37 Yahoo Confidential & Proprietary
The Next Release
!Scalability and performance improvements to handle higher loads
! More 1 and 5 min frequency jobs! High Availability with Load Balancing! Flexible Cron-Based Scheduling! Handling cluster Rolling upgrades for Hadoop 2.0
Roadmap
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
38/39
Q & A
-
8/12/2019 ooziehugoct2013-131112012637-phpapp01
39/39