hadoop summit 2014: building a self-service hadoop platform at linkedin with azkaban

Post on 27-Jan-2015

110 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.

TRANSCRIPT

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Building a Self-Service Hadoop Platform at LinkedInwith AzkabanHadoop Summit 2014

David Z. Chen

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 3

Hadoop at LinkedIn

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Hadoop at LinkedIn

4

Profile PageHome Page

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 5

Hadoop at LinkedIn

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Evolution of Workflows

620092010201120122013

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 7

Azkaban 1.0

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 8

Azkaban 1.0

Run workflows Schedule jobs Job History Failure notification Easy to use web UI and

visualizations

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 9

Azkaban 2.0

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 10

Azkaban 2.0

Major re-architecting Separate executor and web servers User authentication Pluggable database drivers

– H2– MySQL

Brand new UI

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 11

Azkaban 2.0

Jobtype plugins– Built-in type: command– Pluggable jobtypes:

Java Pig Hive

– Non-Hadoop jobtypes: Teradata Voldemort

Viewer plugins – extending the Azkaban UI for other tools

– HDFS browser– Reportal

LinkedIn-specific code as plugins

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 12

Azkaban 2.5

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 13

Azkaban 2.5

UI overhauled using Bootstrap Embedded flows New self-service tools

– Job Summary– Flow Summary– Pig Visualizer

Jobtype-specific plugins HDFS viewer improvements

– Display file schema in addition to content

– Parquet file viewer And more

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 14

Who’s using Azkaban?

Software Engineers Data Scientists Analysts Product Managers

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 15

Azkaban Today

Workflow manager and scheduler Integrated runtime environment Unified front-end for Hadoop tools

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 16

Good News! Success!

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 17

Bad News! Success

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 18

Creating and Running Workflows

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 19

Creating Workflows

Add job “type” plugins– hadoopJava– Command– Pig– Hive

Dependencies– Determine the dependency graph

Parameter passing– Parameters can be passed to job

type=pigcreamy.level=4chunky.level=4...

type=hadoopJavajelly.type=grapesugar=HFCS...

type=commandbread.type=wheatdependencies=peanutbutter,jelly...

peanutbutter.job

bread.job

jelly.job

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 20

Embedded Flows

Embed a flow as a node in another flow.

– “flow” job type– Set flow.name to name of the

embedded flow– Parameters can be passed to flow

peanutbutter jelly

bread

type=flowflow.name=breaddependencies=coffee,fruit

type=hivecoffee.decaf=falsecoffee.cream=true...

type=hadoopJavafruit.type=apple...

coffee.job fruit.job

sandwich.job

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 21

Project ManagementProject Page

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 22

Running WorkflowsFlow Execution Panel

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 23

Running WorkflowsNotification Options

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 24

Running WorkflowsFailure Options

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 25

Finish Current– Finishes current running flows, then stops

Cancel All– Kills all running jobs and finishes immediately

Finish Possible– Finish all possible jobs if their dependencies have met. Then it fails.

Running WorkflowsFailure Options

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 26

Running WorkflowsFlow Parameters

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 27

Running WorkflowsConcurrent Execution Options

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 28

Skip Executions– Prevent concurrent executions

Run Concurrently– Concurrently run the flow

Pipeline– Distance 1: jobA waits until concurrent jobA finishes– Distance 2: jobA waits until concurrent jobA’s children finishes

Running WorkflowsConcurrent Execution Options

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 29

Running WorkflowsExecuting Flow Page

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 30

Running WorkflowsFlow Job List

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 31

Scheduling Workflows

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 32

Scheduling WorkflowsSchedule Flow Panel

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 33

Scheduling WorkflowsScheduled Flows

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 34

Scheduling FlowsSetting SLAs

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 35

Debugging and Tuning

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 36

Hadoop at LinkedIn

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 37

Job Execution History

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 38

Flow Execution History

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 39

Running WorkflowsJob Logs

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 40

Job Summary

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 41

Pig Visualizer

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 42

Pig Visualizer

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 43

Pig Visualizer

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 44

Pig Visualizer

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 45

Pig Visualizer

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 46

Flow Summary

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 47

Flow Summary

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 48

Browsing HDFS

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 49

HDFS ViewerBrowsing Files

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 50

HDFS ViewerViewing Files

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 51

HDFS ViewerFile Schema

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 52

Avro Parquet Binary JSON Sequence File Image Text

HDFS ViewerSupported File Types

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 53

Reportal

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 54

ReportalDashboard

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 55

ReportalNew Report

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 56

ReportalViewing Results

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 57

Pig Hive Teradata

ReportalSupported Query Types

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 58

Upcoming Features

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 59

Azkaban Gradle Plugin and DSL

Describe Azkaban flow and deploy with Gradle

Single file (more if you want) to describe all your workflows

– Compiles to .job files Static checker Valid Groovy code

– Add conditionals for deployment to different clusters

azkaban { jobConfDir = ‘./jobs’ workflow(‘workflow2’) { pigJob(‘job2’) { script = ‘src/main/pig/count-by-country.job’ parameter ‘inputFile’, ‘/user/foo/sample’ reads ‘/data/databases/foo’, [as: ‘input’] writes ‘/data/databases/bar’, [as: ‘output’] }

hiveJob(‘job3’) { query = ‘show tables’ }

workflowDepends ‘job2’, ‘job3’ }}

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 60

Future Roadmap

New visualizers (Hive, Tez, etc.) Support DSL from other tools Operationalization tooling Scalability improvements Improved plugin interfaces

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 61

Future Discussions

Conditional branching Hive Metastore browser Pluggable executors (e.g. YARN) Persistence storage server Launching and monitoring long-running YARN applications (Samza, Storm,

etc.)

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 62

Main Contributors

David Chen (LinkedIn) Hien Luu (LinkedIn) Anthony Hsu (LinkedIn) Alex Bain (LinkedIn) Richard Park (RelateIQ) Chenjie Yu (Tango) Shida Li (University of Waterloo)

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 63

How to Contribute

Website: azkaban.github.io

GitHub: github.com/azkaban

LinkedIn’s Data Website: data.linkedin.com

top related