hadoop summit 2014: building a self-service hadoop platform at linkedin with azkaban

64
Data Analytics Infrastructure ©2013 LinkedIn Corporation. All Rights Reserved. Building a Self-Service Hadoop Platform at LinkedIn with Azkaban Hadoop Summit 2014 David Z. Chen

Upload: david-chen

Post on 27-Jan-2015

110 views

Category:

Technology


5 download

DESCRIPTION

Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.

TRANSCRIPT

Page 1: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Building a Self-Service Hadoop Platform at LinkedInwith AzkabanHadoop Summit 2014

David Z. Chen

Page 2: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Page 3: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 3

Hadoop at LinkedIn

Page 4: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Hadoop at LinkedIn

4

Profile PageHome Page

Page 5: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 5

Hadoop at LinkedIn

Page 6: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.

Evolution of Workflows

620092010201120122013

Page 7: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 7

Azkaban 1.0

Page 8: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 8

Azkaban 1.0

Run workflows Schedule jobs Job History Failure notification Easy to use web UI and

visualizations

Page 9: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 9

Azkaban 2.0

Page 10: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 10

Azkaban 2.0

Major re-architecting Separate executor and web servers User authentication Pluggable database drivers

– H2– MySQL

Brand new UI

Page 11: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 11

Azkaban 2.0

Jobtype plugins– Built-in type: command– Pluggable jobtypes:

Java Pig Hive

– Non-Hadoop jobtypes: Teradata Voldemort

Viewer plugins – extending the Azkaban UI for other tools

– HDFS browser– Reportal

LinkedIn-specific code as plugins

Page 12: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 12

Azkaban 2.5

Page 13: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 13

Azkaban 2.5

UI overhauled using Bootstrap Embedded flows New self-service tools

– Job Summary– Flow Summary– Pig Visualizer

Jobtype-specific plugins HDFS viewer improvements

– Display file schema in addition to content

– Parquet file viewer And more

Page 14: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 14

Who’s using Azkaban?

Software Engineers Data Scientists Analysts Product Managers

Page 15: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 15

Azkaban Today

Workflow manager and scheduler Integrated runtime environment Unified front-end for Hadoop tools

Page 16: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 16

Good News! Success!

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Page 17: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 17

Bad News! Success

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Page 18: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 18

Creating and Running Workflows

Page 19: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 19

Creating Workflows

Add job “type” plugins– hadoopJava– Command– Pig– Hive

Dependencies– Determine the dependency graph

Parameter passing– Parameters can be passed to job

type=pigcreamy.level=4chunky.level=4...

type=hadoopJavajelly.type=grapesugar=HFCS...

type=commandbread.type=wheatdependencies=peanutbutter,jelly...

peanutbutter.job

bread.job

jelly.job

Page 20: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 20

Embedded Flows

Embed a flow as a node in another flow.

– “flow” job type– Set flow.name to name of the

embedded flow– Parameters can be passed to flow

peanutbutter jelly

bread

type=flowflow.name=breaddependencies=coffee,fruit

type=hivecoffee.decaf=falsecoffee.cream=true...

type=hadoopJavafruit.type=apple...

coffee.job fruit.job

sandwich.job

Page 21: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 21

Project ManagementProject Page

Page 22: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 22

Running WorkflowsFlow Execution Panel

Page 23: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 23

Running WorkflowsNotification Options

Page 24: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 24

Running WorkflowsFailure Options

Page 25: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 25

Finish Current– Finishes current running flows, then stops

Cancel All– Kills all running jobs and finishes immediately

Finish Possible– Finish all possible jobs if their dependencies have met. Then it fails.

Running WorkflowsFailure Options

Page 26: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 26

Running WorkflowsFlow Parameters

Page 27: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 27

Running WorkflowsConcurrent Execution Options

Page 28: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 28

Skip Executions– Prevent concurrent executions

Run Concurrently– Concurrently run the flow

Pipeline– Distance 1: jobA waits until concurrent jobA finishes– Distance 2: jobA waits until concurrent jobA’s children finishes

Running WorkflowsConcurrent Execution Options

Page 29: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 29

Running WorkflowsExecuting Flow Page

Page 30: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 30

Running WorkflowsFlow Job List

Page 31: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 31

Scheduling Workflows

Page 32: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 32

Scheduling WorkflowsSchedule Flow Panel

Page 33: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 33

Scheduling WorkflowsScheduled Flows

Page 34: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 34

Scheduling FlowsSetting SLAs

Page 35: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 35

Debugging and Tuning

Page 36: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 36

Hadoop at LinkedIn

1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day

Page 37: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 37

Job Execution History

Page 38: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 38

Flow Execution History

Page 39: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 39

Running WorkflowsJob Logs

Page 40: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 40

Job Summary

Page 41: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 41

Pig Visualizer

Page 42: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 42

Pig Visualizer

Page 43: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 43

Pig Visualizer

Page 44: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 44

Pig Visualizer

Page 45: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 45

Pig Visualizer

Page 46: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 46

Flow Summary

Page 47: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 47

Flow Summary

Page 48: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 48

Browsing HDFS

Page 49: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 49

HDFS ViewerBrowsing Files

Page 50: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 50

HDFS ViewerViewing Files

Page 51: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 51

HDFS ViewerFile Schema

Page 52: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 52

Avro Parquet Binary JSON Sequence File Image Text

HDFS ViewerSupported File Types

Page 53: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 53

Reportal

Page 54: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 54

ReportalDashboard

Page 55: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 55

ReportalNew Report

Page 56: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 56

ReportalViewing Results

Page 57: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 57

Pig Hive Teradata

ReportalSupported Query Types

Page 58: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 58

Upcoming Features

Page 59: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 59

Azkaban Gradle Plugin and DSL

Describe Azkaban flow and deploy with Gradle

Single file (more if you want) to describe all your workflows

– Compiles to .job files Static checker Valid Groovy code

– Add conditionals for deployment to different clusters

azkaban { jobConfDir = ‘./jobs’ workflow(‘workflow2’) { pigJob(‘job2’) { script = ‘src/main/pig/count-by-country.job’ parameter ‘inputFile’, ‘/user/foo/sample’ reads ‘/data/databases/foo’, [as: ‘input’] writes ‘/data/databases/bar’, [as: ‘output’] }

hiveJob(‘job3’) { query = ‘show tables’ }

workflowDepends ‘job2’, ‘job3’ }}

Page 60: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 60

Future Roadmap

New visualizers (Hive, Tez, etc.) Support DSL from other tools Operationalization tooling Scalability improvements Improved plugin interfaces

Page 61: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 61

Future Discussions

Conditional branching Hive Metastore browser Pluggable executors (e.g. YARN) Persistence storage server Launching and monitoring long-running YARN applications (Samza, Storm,

etc.)

Page 62: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 62

Main Contributors

David Chen (LinkedIn) Hien Luu (LinkedIn) Anthony Hsu (LinkedIn) Alex Bain (LinkedIn) Richard Park (RelateIQ) Chenjie Yu (Tango) Shida Li (University of Waterloo)

Page 63: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 63

How to Contribute

Website: azkaban.github.io

GitHub: github.com/azkaban

LinkedIn’s Data Website: data.linkedin.com

Page 64: Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban