building (better) data pipelines...building (better) data pipelines using apache airflow sid anand...
TRANSCRIPT
![Page 1: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/1.jpg)
Building (Better) Data Pipelines using Apache Airflow
Sid Anand (@r39132) QCon.AI 2018
�1
![Page 2: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/2.jpg)
About Me
�2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
![Page 3: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/3.jpg)
Apache Airflow
�3
What is it?
![Page 4: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/4.jpg)
�4
Apache Airflow : What is it?
In a :
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
![Page 5: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/5.jpg)
Apache Airflow
�5
UI Walk-Through
![Page 6: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/6.jpg)
�6
Apache Airflow : UI Walk-through
![Page 7: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/7.jpg)
Airflow - Authoring DAGs
�7
Airflow: Visualizing a DAG
![Page 8: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/8.jpg)
�8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
![Page 9: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/9.jpg)
�9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
![Page 10: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/10.jpg)
Airflow - Performance Insights
�10
Airflow: Gantt charts reveal the slowest tasks for a run!
![Page 11: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/11.jpg)
�11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
![Page 12: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/12.jpg)
Apache Airflow
�12
Why use it?
![Page 13: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/13.jpg)
�13
Apache Airflow : Why use it?When would you use a Workflow Scheduler like Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
![Page 14: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/14.jpg)
�14
What should a Workflow Scheduler do well? • Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
![Page 15: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/15.jpg)
�15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
![Page 16: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/16.jpg)
Use-Case : Message ScoringBatch Pipeline Architecture
�16
![Page 17: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/17.jpg)
Use-Case : Message Scoring
�17
enterprise Aenterprise Benterprise C
S3
S3 uploads every 15 minutes
![Page 18: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/18.jpg)
Use-Case : Message Scoring
�18
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour
![Page 19: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/19.jpg)
Use-Case : Message Scoring
�19
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
![Page 20: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/20.jpg)
Use-Case : Message Scoring
�20
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
![Page 21: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/21.jpg)
Use-Case : Message Scoring
�21
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
![Page 22: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/22.jpg)
�22
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 23: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/23.jpg)
�23
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 24: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/24.jpg)
�24
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
![Page 25: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/25.jpg)
�25
Airflow DAG
![Page 26: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/26.jpg)
Apache Airflow
�26
Incubating
![Page 27: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/27.jpg)
�27
Apache Airflow : Incubating
Timeline • Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator
Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
![Page 28: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/28.jpg)
Thank You!
�28
![Page 29: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/29.jpg)
Apache Airflow
�29
Behind the Scenes
![Page 30: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/30.jpg)
�30
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers!
Apache Airflow : Behind the Scenes
![Page 31: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/31.jpg)
�31
Apache Airflow : Behind the ScenesWebserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
![Page 32: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/32.jpg)
�32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
![Page 33: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/33.jpg)
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
�33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
![Page 34: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/34.jpg)
�34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
![Page 35: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need](https://reader033.vdocuments.us/reader033/viewer/2022051822/5febee0546585a0bba119953/html5/thumbnails/35.jpg)
Thank You!
�35