introduction to apache airflow - data day seattle 2016
TRANSCRIPT
4
Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
5
Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
It ships with • DAG Scheduler • Web application (UI) • Powerful CLI
12
Apache Airflow : What is it?When would you use a Workflow Scheduler like Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
13
Apache Airflow : What is it?
What should a Workflow Scheduler do well? • Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met
• Scale
14
Apache Airflow : What is it?What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
16
Apache Airflow : Incubating
Timeline • Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator
Today • 166+ Contributors • 300+ Users • 40+ companies officially using it! • 9 Committers/Maintainers <— We’re growing here
23
Enterprise Customers
email metadata
apply trust
models
email md + trust score
Agari’s Current Product
Agari : What We Do
24
email metadata
apply trust
models
email md+ trust score
Agari’s Future ProductEnterprise Customers
Agari : What We Do
Classes of Orchestration
26
apply trust models
(message scoring)
build trust models
cron++ (general
job scheduler)
New Product (Enterprise Protect)
Operational Automation BI / ETL
N / A
Classes of Orchestration
27
apply trust models
(message scoring)
build trust models
cron++ (general
job scheduler)
New Product (Enterprise Protect)
Operational Automation BI / ETL
N / A
This Talk
Use-Case : Message Scoring
30
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour
Use-Case : Message Scoring
31
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
32
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
Use-Case : Message Scoring
33
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
34
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
35
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
36
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
46
• Compute discrepancies
• Send email report
• Update monitoring graphs
• Raise SLA (correctness) alerts
Airflow DAG
48
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Scalable/Available
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs (e.g. 1 hour)
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• ASGs, SQS, SNS, S3
49
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs (e.g. 1 hour)
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
SLA
SLA• ASGs, SQS, SNS, S3 Scalable/Available
51
Correctness : Email Reporting
For each org, we check for duplicate or missing data as a count & percentage
orgs
52
Correctness : Email Reporting
These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!
orgs
55
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
56
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)
57
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
Correctness SLA miss
58
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps
66
Apache Airflow Next Steps
Improvement Areas
• Security
• API (though we do have a CLI)
• Deployment / Versioning
• Execution Scale Out
• On-demand Execution
Acknowledgments
67
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones
• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle
None of this work would be possible without the contributions of the strong team below