hadoop summit 2014: building a self-service hadoop platform at linkedin with azkaban

Building a Self-Service Hadoop Platform at LinkedInwith AzkabanHadoop Summit 2014

David Z. Chen

Hadoop at LinkedIn

Profile PageHome Page

Hadoop at LinkedIn

Evolution of Workflows

620092010201120122013

Azkaban 1.0

Run workflows Schedule jobs Job History Failure notification Easy to use web UI and

visualizations

Azkaban 2.0

Major re-architecting Separate executor and web servers User authentication Pluggable database drivers

– H2– MySQL

Brand new UI

Azkaban 2.0

Jobtype plugins– Built-in type: command– Pluggable jobtypes:

Java Pig Hive

– Non-Hadoop jobtypes: Teradata Voldemort

Viewer plugins – extending the Azkaban UI for other tools

– HDFS browser– Reportal

LinkedIn-specific code as plugins

Azkaban 2.5

UI overhauled using Bootstrap Embedded flows New self-service tools

– Job Summary– Flow Summary– Pig Visualizer

Jobtype-specific plugins HDFS viewer improvements

– Display file schema in addition to content

– Parquet file viewer And more

Who’s using Azkaban?

Software Engineers Data Scientists Analysts Product Managers

Azkaban Today

Workflow manager and scheduler Integrated runtime environment Unified front-end for Hadoop tools

Good News! Success!

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Bad News! Success

Creating and Running Workflows

Creating Workflows

Add job “type” plugins– hadoopJava– Command– Pig– Hive

Dependencies– Determine the dependency graph

Parameter passing– Parameters can be passed to job

type=pigcreamy.level=4chunky.level=4...

type=hadoopJavajelly.type=grapesugar=HFCS...

type=commandbread.type=wheatdependencies=peanutbutter,jelly...

peanutbutter.job

bread.job

jelly.job

Embedded Flows

Embed a flow as a node in another flow.

– “flow” job type– Set flow.name to name of the

embedded flow– Parameters can be passed to flow

peanutbutter jelly

type=flowflow.name=breaddependencies=coffee,fruit

type=hivecoffee.decaf=falsecoffee.cream=true...

type=hadoopJavafruit.type=apple...

coffee.job fruit.job

sandwich.job

Project ManagementProject Page

Running WorkflowsFlow Execution Panel

Running WorkflowsNotification Options

Running WorkflowsFailure Options

Finish Current– Finishes current running flows, then stops

Cancel All– Kills all running jobs and finishes immediately

Finish Possible– Finish all possible jobs if their dependencies have met. Then it fails.

Running WorkflowsFailure Options

Running WorkflowsFlow Parameters

Running WorkflowsConcurrent Execution Options

Skip Executions– Prevent concurrent executions

Run Concurrently– Concurrently run the flow

Pipeline– Distance 1: jobA waits until concurrent jobA finishes– Distance 2: jobA waits until concurrent jobA’s children finishes

Running WorkflowsConcurrent Execution Options

Running WorkflowsExecuting Flow Page

Running WorkflowsFlow Job List

Scheduling Workflows

Scheduling WorkflowsSchedule Flow Panel

Scheduling WorkflowsScheduled Flows

Scheduling FlowsSetting SLAs

Debugging and Tuning

Hadoop at LinkedIn

Job Execution History

Flow Execution History

Running WorkflowsJob Logs

Job Summary

Pig Visualizer

Flow Summary

Browsing HDFS

HDFS ViewerBrowsing Files

HDFS ViewerViewing Files

HDFS ViewerFile Schema

Avro Parquet Binary JSON Sequence File Image Text

HDFS ViewerSupported File Types

Reportal

ReportalDashboard

ReportalNew Report

ReportalViewing Results

Pig Hive Teradata

ReportalSupported Query Types

Upcoming Features

Azkaban Gradle Plugin and DSL

Describe Azkaban flow and deploy with Gradle

Single file (more if you want) to describe all your workflows

– Compiles to .job files Static checker Valid Groovy code

– Add conditionals for deployment to different clusters

azkaban { jobConfDir = ‘./jobs’ workflow(‘workflow2’) { pigJob(‘job2’) { script = ‘src/main/pig/count-by-country.job’ parameter ‘inputFile’, ‘/user/foo/sample’ reads ‘/data/databases/foo’, [as: ‘input’] writes ‘/data/databases/bar’, [as: ‘output’] }

hiveJob(‘job3’) { query = ‘show tables’ }

workflowDepends ‘job2’, ‘job3’ }}

Future Roadmap

New visualizers (Hive, Tez, etc.) Support DSL from other tools Operationalization tooling Scalability improvements Improved plugin interfaces

Future Discussions

Conditional branching Hive Metastore browser Pluggable executors (e.g. YARN) Persistence storage server Launching and monitoring long-running YARN applications (Samza, Storm,

Main Contributors

David Chen (LinkedIn) Hien Luu (LinkedIn) Anthony Hsu (LinkedIn) Alex Bain (LinkedIn) Richard Park (RelateIQ) Chenjie Yu (Tango) Shida Li (University of Waterloo)

How to Contribute

Website: azkaban.github.io

GitHub: github.com/azkaban

LinkedIn’s Data Website: data.linkedin.com

hadoop summit 2014: building a self-service hadoop platform at linkedin with azkaban

workflows flow parameters

azkaban ui

fruit type

flow job type set

workflows flow execution

workflows failure options

evolution of workflows

workflows notification

Technology

strata sg 2015: linkedin self serve reporting platform on...

harry potter and the prisoner of azkaban suite

hp azkaban

building a healthy data ecosystem around kafka and hadoop:...

a tour of the zoo – hadoop ecosystem - · pdf...

harry potter and the prisoner of azkaban - buckbeaks flight

strata 2017 (san jose): building a healthy data ecosystem...

harry potter and the prisoner of azkaban in...

azkaban documentation

kafka and hadoop at linkedin meetup

fileshare_harry potter si prizonierul de la azkaban -...

leaky cauldron; prisoner of azkaban (site #) borough...

related searches at linkedin - mitul...

harry potter and the prisoner of azkaban (john williams)

harry potter and the prisoner of azkaban: final

book 3 harry potter and the prisoner of azkaban [dobd99]

partners 2013 linkedin use cases for teradata connectors for...

managing capacity @ linkedin · linkedin espresso hadoop...

the past, present, and future of hadoop at linkedin

john williams - harry potter and the prisoner of azkaban -...