airflow - a data flow engine
TRANSCRIPT
![Page 1: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/1.jpg)
AIRFLOW - DATA FLOW ENGINE FROM AIRBNBBy Walter Liu 2016/01/28
![Page 2: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/2.jpg)
Who am I? Archer Architect of TrendMicro Coretech
backend A new user in Airflow
![Page 3: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/3.jpg)
Why Data Flow Engine?
![Page 4: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/4.jpg)
Cron Job
![Page 5: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/5.jpg)
Direct Acyclic Graph (DAG)
![Page 6: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/6.jpg)
![Page 7: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/7.jpg)
![Page 8: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/8.jpg)
Data relationships Data availability
if the data is not there, trigger the process to generate the data.
Data dependency Some data relies on other data to generate.
![Page 9: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/9.jpg)
Operability Job failed and resume Job monitor Backfill
![Page 10: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/10.jpg)
Airflow
![Page 11: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/11.jpg)
DAG structure as code
![Page 12: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/12.jpg)
![Page 13: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/13.jpg)
![Page 14: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/14.jpg)
![Page 15: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/15.jpg)
![Page 16: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/16.jpg)
Demo python tutorial.py airflow list_dags airflow list_tasks tutorial airflow list_tasks tutorial --tree airflow test tutorial print_date 2015-06-01 airflow test tutorial sleep 2015-06-01 airflow run tutorial templated 2015-06-01 Backfill: airflow backfill tutorial -s 2015-06-07 -e 2015-06-10
Run again Run another date range
Site: http://localhost:8080/admin/
![Page 17: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/17.jpg)
UI – DAG view
![Page 18: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/18.jpg)
UI – Tree View
![Page 19: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/19.jpg)
UI – Graph View
![Page 20: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/20.jpg)
UI - Gantt
![Page 21: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/21.jpg)
UI – Task duration
![Page 22: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/22.jpg)
UI – Code
![Page 23: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/23.jpg)
Airflow - Pros Dynamic generating path Have both Time scheduler and Command line trigger Has Master/Worker model (automatically distribute tasks) Scale if you have many tasks in a chain.
But not be so useful to most of our tasks. Maybe useful for full dump. Fancy UI
Dependencies of tasks Task success/failure Scheduled tasks status
Has utility lib to wait_data for S3, Hadoop …
![Page 24: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/24.jpg)
Airflow - Cons Additional DB/Redis or Rabbitmq for Celery
HA design: Use RDBMS/redis-cache in AWS Require python 2.7 and many other libraries. Not dependent on data. Just task
dependency. (Not big cons) Write check data file code.
![Page 25: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/25.jpg)
A snippet of the task of Luigi
![Page 26: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/26.jpg)
Recap Solving DAG job management problem Features to improve daily operation
Monitor/Visualization Job failure and resume Backfill
![Page 27: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/27.jpg)
Backup slides
![Page 28: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/28.jpg)
Other Selections? Oozie (on Hadoop) Luigi (by Spotify)
Mario (Luigi in Scala) Ruigi (Luigi in R)
Airflow (by Airbnb) Mistral (by OpenStack) Pinball (by Pinterest) …
![Page 29: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/29.jpg)
Github statisticsStart Date Star Commits in
recent 2 months
Airflow 2014/10/05 1516 362
Luigi 2011/11/13 3777 105
Pinball 2015/05/01 476 0
Oozie 2011/08/28 167 14
![Page 30: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/30.jpg)
Oozie Pros
Mature Native support of Hadoop ECO system Python is not required.
Cons Big and complex XML to define the flow and config. Control flow is somehow restrictive Hadoop ECO system is required.
Unix box, Java, Hadoop, Pig, ExtJS library
![Page 31: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/31.jpg)
Oozie XML Example
![Page 32: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/32.jpg)
Luigi - Pros Makefile like, dependencies on data (Upward instead of downward) Command line trigger Not only Hadoop Support target: HDFS/S3/RDBMS/…. (not limited)
Write your own code to support Casandra, etc. Dependencies are decentralized, no big XML. UI Dynamic dependency/task is supported. (Don’t know it is better
than airflow or not) [util] Date algebra
![Page 33: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/33.jpg)
Luigi - Cons No built-in triggering
No crontab like things (could borrow Chronos)
Use cron to run tasks on specific nodes. (normally)
![Page 34: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/34.jpg)
Luigi - Notes Local execution
Pros: Easy to debug Cons:
need access to other system. (Well, other system didn’t achieve this either.) No scalability if you have many tasks in a chain. Centralized scheduler servers (optional, recommended for
production) Make sure two instances of the same task are not running simultaneously Provide visualization of everything that’s going on.
HA option Run 2+ tasks on 2+ machines at the same time
![Page 35: Airflow - a data flow engine](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f77bd1a28ab10258b68f5/html5/thumbnails/35.jpg)
References http://erikbern.com/2015/07/02/more-
luigi-alternatives/