building data pipelines in python using apache...

13
Building Data Pipelines in Python using Apache Airflow STL Python Meetup Aug 2nd 2016 @conornash

Upload: others

Post on 20-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Building Data Pipelines in Python using Apache Airflow

STL Python Meetup Aug 2nd 2016 @conornash

Page 2: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

What is Apache Airflow?• Airflow is a platform to

programmatically author, schedule and monitor workflows

• Designed for batch jobs, not for real-time data streams

• Originally developed at AirBnB by Maxime Beauchemin, now incubating as an Apache project

Page 3: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Why would you want to use it?

• Companies grow to have a complex network of processes and data that have intricate dependencies

• Analytics & batch processing is becoming increasingly important

• Want to find a way to scale up analytics/batch processing while keeping time spent writing/monitoring/troubleshooting to a minimum

• Useful even for small workflows/batch jobs

Page 4: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow Features• Dependency management (DAGs)

• Status visibility

• Scheduling

• Log storage/retrieval

• Parameterized retries

• Distributed DAGs (RabbitMQ)

• Queues

• Pools

• Branching/Partial Success

• SLA monitoring

• Jinja templating

• Plugin system and more…

Page 5: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: Dashboard

Page 6: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: DAG

Page 7: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Quick start requirements• Python 2 or 3

• Make new project (virtualenv, pyenv, …)

• $ cd <project folder path> && export AIRFLOW_HOME=<project folder path>

• $ pip install airflow

• $ airflow initdb

• $ airflow webserver -p 8080

Page 8: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: First DAG

• Existing Python/Bash/Java/etc. script that is difficult to monitor

• Probably already set up as a cron (Unix) or scheduled task (Windows)

• Want to integrate it into an Airflow DAG

Page 9: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: First DAG

Page 10: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: Complex DAG

Page 11: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Airflow: Complex DAG

Page 12: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Why would you want to use it?• Data Warehousing

• Anomaly Detection

• Search Ranking

• Model Training

• Text Analysis

• Experimentation (i.e. A/B tests)

• Data Cleaning

• 3rd Party Data Integration

Page 13: Building Data Pipelines in Python using Apache Airflowfiles.meetup.com/6240052/STLPython.Airflow.pdf · 2016-08-05 · Building Data Pipelines in Python using Apache Airflow STL

Q&ATwitter: @conornash

Email: [email protected]