oozie at yahoo! jun 3rd 2014

8
Oozie at Yahoo Purshotam Shah, Ryota Egashira Oozie Meetup 06/03 2014

Upload: mona-chitnis

Post on 26-Jan-2015

105 views

Category:

Engineering


3 download

DESCRIPTION

by Ryota Egashira and Purshotam Shah (Yahoo)

TRANSCRIPT

Page 1: Oozie at Yahoo! Jun 3rd 2014

Oozie at Yahoo

Purshotam Shah, Ryota EgashiraOozie Meetup 06/03 2014

Page 2: Oozie at Yahoo! Jun 3rd 2014

Table of Contents

Yahoo Confidential & Proprietary

▪Scale at Yahoo!

▪Scale and Performance

▪Features - Usability

▪High availability and Load Balancing

▪Customer Asks

Page 3: Oozie at Yahoo! Jun 3rd 2014

Scale at Yahoo!

▪Busiest cluster

› 1 million+ workflows per month

› 45 - 55K workflows per day

› 40 - 50K coord actions per day

› 800 - 900 coordinators (5m, 15m, 30m, hourly, daily and weekly)

› 30 - 40 bundles

▪Most complex bundle - 230 coordinators

▪Most complex workflow - 85 forks

▪Video Transcoding - 100-300 workflows per min

Page 4: Oozie at Yahoo! Jun 3rd 2014

Scale and Performance▪ Database

› CLOB to BLOB to compress and store inline

› Remove unnecessary hadoop config stored in protoActionConf

› Select only needed columns instead of loading whole row

› Partition tables by created time (in Oracle)

▪ Other

› Huge improvements to materialization of coordinator actions

› Reduce Launcher overhead

• Merge the number of small files created per action to one sequence file

• Launcher libraries shipped only once to HDFS

• Uber mode launcher with Hadoop 2.x

› Synchronously execute commands without queueing to speed up action transition

› Automatically killing abandoned coordinator job

› gzip compression for Rest API

Page 5: Oozie at Yahoo! Jun 3rd 2014

Features - Usability

▪UI improvements

› Active Jobs, Custom Global Filters, Child Jobs for Pig/Hive actions

▪Faster log streaming with more filters

▪Updating coordinator definition on the fly

▪Rerun workflows without having to specify all properties again

▪Mark coordinator and actions as ignored

▪Sharelib Enhancements

› Update on the fly without failing jobs

› Command to list different sharelib available

› Specify directories using metafile instead of single share lib directory

Page 6: Oozie at Yahoo! Jun 3rd 2014

High Availability and Load Balancing

▪HCat integration▪SLA▪Sharelib▪Server-server authentication▪Distributed sequence

Page 7: Oozie at Yahoo! Jun 3rd 2014

Customer Asks

▪Coordinator dependency management

› Ability to view dependencies and rerun part of a pipeline

▪Better error handling and automatic retries

▪Ability to Suspend/Turn off SLA alerting

▪One-click launcher log viewing

▪Zero downtime

Page 8: Oozie at Yahoo! Jun 3rd 2014

Rohini PalaniswamyMona ChitnisMichelle ChiangPurshotam ShahRyota EgashiraOlga L. Natkovich

Yahoo Oozie Team