luigi presentation oa summit

18
Data Workflows at Foursquare using Luigi

Upload: open-analytics

Post on 12-May-2015

5.965 views

Category:

Documents


0 download

DESCRIPTION

OA NYC Summit

TRANSCRIPT

Page 1: Luigi presentation OA Summit

Data Workflows at Foursquare using Luigi

Page 2: Luigi presentation OA Summit

Foursquare

•  35 million users

•  Nearly 4 billion check-ins

•  More than 5 million check-ins per day

•  50 million point-of-interest database

•  100's of GB of log data per day

Page 3: Luigi presentation OA Summit

Tools We Use

•  Hive o  Ad hoc analytics, data dumping ground

•  Raw MapReduce o  100's of MapReduce jobs in our codebase

•  Pig o  Fits between structure Hive and free-form

MapReduce

•  Vertica o  Low latency analytics

Page 4: Luigi presentation OA Summit

Cron

E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish...

0 2 * * * ./hadoop-script-2.sh

# And on and on and on

Page 5: Luigi presentation OA Summit

Cron - Problems

•  Brittle

•  Hard to reason about / visualize

•  Spend a lot of time waiting

•  Difficult to tell what succeeded or failed

•  No one likes writing Bash scripts

Page 6: Luigi presentation OA Summit

Oozie

XML-based Workflow Engine, with support for Hadoop, Hive, and Pig

Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel"

Coordinators launch recurring workflows at a given frequency, when dependent data is available

Page 7: Luigi presentation OA Summit

Oozie - Example

Page 8: Luigi presentation OA Summit

Oozie - Problems

•  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse

•  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML

Page 9: Luigi presentation OA Summit

Luigi •  Python framework for batch processing jobs

•  Created by Spotify, open-sourced Sept. 2012

•  Tasks are units of work that produce Targets

•  Tasks can depend on one or more other Tasks

•  A Task is only run if all of its dependent Tasks are done

•  Tasks are idempotent

Page 10: Luigi presentation OA Summit

Luigi - Example Task

Page 11: Luigi presentation OA Summit

Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01

Page 12: Luigi presentation OA Summit

Luigi - Scheduler

Central scheduler ensures each Task is only run by a single worker.

A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01)

Will retry failed Tasks after a configured timeout

Emails someone when a Task fails

Page 13: Luigi presentation OA Summit

Luigi - Visualizer

Page 14: Luigi presentation OA Summit

Luigi - Visualizer

Page 15: Luigi presentation OA Summit

Luigi - Visualizer

Page 16: Luigi presentation OA Summit

Luigi - Advantages over Cron

•  Explicit dependencies

•  No wasted time waiting

•  Easy to tell what has failed

•  Avoid duplicate work / partial failures

Page 17: Luigi presentation OA Summit

Luigi - Advantages over Oozie

•  Explicit dependencies between workflows

•  Easier to write

•  Vastly more extensible

•  Code reuse

•  Can easily re-run individual steps

Page 18: Luigi presentation OA Summit

Thank you!

Check out Luigi: https://github.com/spotify/luigi

Drop me a line: Joe Ennever [email protected]