![Page 1: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/1.jpg)
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
Open Source Summit 2017
![Page 2: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/2.jpg)
Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki
![Page 3: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/3.jpg)
What's Workflow Engine?• Automates your manual operations.
• Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy
(Continuous Delivery)
![Page 4: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/4.jpg)
Challenge: Multiple Cloud & Regions
On-Premises
Different API, Different tools, Many scripts.
![Page 5: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/5.jpg)
Challenge: Multiple DB technologies
Amazon S3
Amazon Redshift
Amazon EMR
![Page 6: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/6.jpg)
Challenge: Multiple DB technologies
Amazon S3
Amazon Redshift
Amazon EMR
> Hi! > I'm a new technology!
![Page 7: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/7.jpg)
Challenge: Modern complex data analytics
Ingest Application logs User attribute data Ad impressions 3rd-party cookie data
Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs
Model A/B Testing Funnel analysis Segmentation analysis Machine learning
Load Creating indexes Data partitioning Data compression Statistics collection
Utilize Recommendation API Realtime ad bidding Visualize using BI applications
Ingest UtilizeEnrich Model Load
![Page 8: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/8.jpg)
Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.shfor query in emr/*.sql; do ./run_emr_hive $querydone./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
![Page 9: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/9.jpg)
Solution: Multi-Cloud Workflow Engine
Solves
> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
![Page 10: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/10.jpg)
Example in our case
1. Dump data to BigQuery
2. load all tables to Treasure Data
3. Run queries
5. Notify on slack
4. Create reports on Tableau Server
(on-premises)
![Page 11: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/11.jpg)
Workflow constructs
![Page 12: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/12.jpg)
Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers > Comfortable for advanced users
Friendly for Analysts > Still straight forward for analysts to
understand & leverage workflows
![Page 13: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/13.jpg)
Unite Engineering & Analytic Teams
Powerful for Engineers > Comfortable for advanced users
Friendly for Analysts > Still straight forward for analysts to
understand & leverage workflows
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
+ is a task
> is an operator
${...} is a variable
![Page 14: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/14.jpg)
Operator library_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email
Open-source libraries You can release & use open-source operator libraries.
![Page 15: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/15.jpg)
Parallel execution
+load_data:
_parallel: true
+load_users:
redshift>: copy/users.sql
+load_items:
redshift>: copy/items.sql
Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
![Page 16: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/16.jpg)
Loops & Parameters
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter A task can propagate parameters to following tasks
Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
![Page 17: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/17.jpg)
Grouping workflows...
Ingest UtilizeEnrich Model Load
+task
+task
+task+task +task
+task +task
+task
+task
+task +task +task
![Page 18: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/18.jpg)
Grouping workflows
Ingest UtilizeEnrich Model Load
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn +load
+task +task+tasks
+task
![Page 19: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/19.jpg)
Pushing workflows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor
Docker > Install scripts & dependences in a
Docker image, not on a server. > Workflows can run anywhere including
developer's laptop.
![Page 20: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/20.jpg)
Demo
![Page 21: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/21.jpg)
Digdag is production-ready
Digdag server PostgreSQL
It's just like a web application.
Digdag client
All task state
API & scheduler & executor
Visual UI
![Page 22: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/22.jpg)
Digdag is production-ready
PostgreSQL
Stateless servers + Replicated DB
Digdag client
API & scheduler & executor
PostgreSQL
All task state
Digdag server
Digdag server
HTTP Load Balancer
Visual UI
HA
![Page 23: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/23.jpg)
Digdag is production-ready
Digdag server PostgreSQL
Isolating API and execution for reliability
Digdag client
API
PostgreSQL
HA
Digdag server
Digdag server
Digdag server
scheduler &executor
HTTP Load Balancer
All task state
![Page 24: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/24.jpg)
Digdag at Treasure Data
3,600 workflows run every day 28,000 tasks run every day
850 active workflows
400,000 workflow executions in total
![Page 25: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/25.jpg)
Digdag & Open Source
![Page 26: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/26.jpg)
Learning from my OSS projects• Make it pluggable!
700+ plugins in 6 years
200+ plugins in 3 yearsinput/output, parser/formatter,decoder/encoder, filter, and executor
input/output, and filter
70+ implementations in 8 years
![Page 27: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:](https://reader030.vdocuments.us/reader030/viewer/2022041212/5dd11b27d6be591ccb644013/html5/thumbnails/27.jpg)
Digdag also has plugin architecture
32 operators 7 schedulers 2 command executors 1 error notification module