luigi presentation

Post on 01-Dec-2015

71 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

OANYC Summit

TRANSCRIPT

Data Workflows at Foursquare using Luigi

Foursquare

• 35 million users

• Nearly 4 billion check-ins

• More than 5 million check-ins per day

• 50 million point-of-interest database

• 100's of GB of log data per day

Tools We Use

• Hiveo Ad hoc analytics, data dumping ground

• Raw MapReduceo 100's of MapReduce jobs in our codebase

• Pigo Fits between structure Hive and free-form

MapReduce

• Verticao Low latency analytics

Cron

E.g.0 0 * * * ./hadoop-script-1.sh

# Wait two hours for that job to finish...

0 2 * * * ./hadoop-script-2.sh

# And on and on and on

Cron - Problems

• Brittle

• Hard to reason about / visualize

• Spend a lot of time waiting

• Difficult to tell what succeeded or failed

• No one likes writing Bash scripts

Oozie

XML-based Workflow Engine, with support for Hadoop, Hive, and Pig

Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel"

Coordinators launch recurring workflows at a given frequency, when dependent data is available

Oozie - Example

Oozie - Problems

• Workflows are all-or-nothingo Cannot just run step that failedo Very little code reuse

• Little to no extensibility

• Limited control flow

• Extremely verbose

• Difficult to test

• No one likes writing XML

Luigi

• Python framework for batch processing jobs

• Created by Spotify, open-sourced Sept. 2012

• Tasks are units of work that produce Targets

• Tasks can depend on one or more other Tasks

• A Task is only run if all of its dependent Tasks are done

• Tasks are idempotent

Luigi - Example Task

Luigi - Running the Task

$ python word-count.py WordCount --date 2013-06-01

Luigi - Scheduler

Central scheduler ensures each Task is only run by a single worker.

A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01)

Will retry failed Tasks after a configured timeout

Emails someone when a Task fails

Luigi - Visualizer

Luigi - Visualizer

Luigi - Visualizer

Luigi - Advantages over Cron

• Explicit dependencies

• No wasted time waiting

• Easy to tell what has failed

• Avoid duplicate work / partial failures

Luigi - Advantages over Oozie

• Explicit dependencies between workflows

• Easier to write

• Vastly more extensible

• Code reuse

• Can easily re-run individual steps

Thank you!

Check out Luigi:

https://github.com/spotify/luigi

Drop me a line:

Joe Ennever

jennever@foursquare.com

top related