automate your big data workflows (svc201) | aws re:invent 2013
DESCRIPTION
As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
SVC201 - Automate Your Big Data Workflows
Jinesh Varia, Technology Evangelist
@jinman
November 14, 2013
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Big Data Workflows
Automating Compute Automating Data
DeciderWorker Starters
Activity Worker
Amazon SWF
Activity Worker
AWS Management ConsoleHistory
Amazon SWF – Your Distributed State Machine in the Cloud
SWF helps you scale your business logic
Tim JamesVijay Ramesh
- Data/science Architect, Manager- Data/science Engineer
the world's largest petition platform
At Change.org in the last year
• 120M+ signatures — 15% on victories• 4000 declared victories
This works.
How?
60-90% signatures at Change.orgdriven by email
This works.
This works.* up to a point!
*
Manual Targeting doesn’t
scale.
Manual Targeting doesn’t scale
cognitively.
Manual Targeting doesn’t scale
in personnel.
Manual Targeting doesn’t scale
into mass customization.
Manual Targeting doesn’t scale
culturally or internationally.
Manual Targeting doesn’t scale
with data size and load.
So what did we do?
We used big-compute machine learning to automatically target our mass emails across each week’s set of campaigns.
We started from here...
And finished here...
First: Incrementally extract (and verify) MySQL data to Amazon S3
Best Practice:
Incrementally extract with high watermarking.
(not wall-clock intervals)
Best Practice:
Verify data continuity after extract.
We used Cascading/Amazon EMR + Amazon SNS.
Transform extracted data on S3 into “Feature Matrix”using Cascading/Hadoop on Amazon Elastic MapReduce100-instance EMR cluster
A Feature Matrix is just a text file.
Sparse vector file line format, one line per user.
<user_id>[ <feature_id>:<feature_value>]...
Example:
123 12:0.237 18:1 101:0.578
So how do we do
big-compute Machine Learning?
Enter Amazon • Simple Workflow Service• Elastic Compute Cloud
SWFEC2
SWF and EC2 allowed us to decouple:
• Control (and error) flow
• Task business logic• Compute resource provisioning
SWF provides a distributed application model
Decider processes make discrete workflow decisions
Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)
SWF provides a distributed application model
Allows deciders and workers to be implemented in any language.
We used Rubywith ML calculations done by Python, R, or C.
SWF provides a distributed application model
Rich web interface via the AWS Management Console.
Flexible API for control and monitoring.
SWF provides a distributed application model
Resource Provisioning with EC2
Our EC2 instances each provide servicevia Simple Workflow Service
for a single Feature Matrix file.
Simplifying Assumption:
Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)
Best Practice:
Treat compute resources as
hotel rooms, not mansions.
Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI)
EC2 instance tags provide highly visible, searchable configuration.
Update local git repo to configured software version.
EC2 instance tags
Best Practice:
Log bootstrap steps to S3mapping essential config tags to EC2 instance names and log files
Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.
Provisioning in R&D for Training
• Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space
• Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python
Provisioning in Production
• Train with single SWF worker using multiple cores (python multithreaded Random Forest)
• Predict with 8 SWF workers — 1 per core, 4 cores per instance
Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group
Provisioning in Production
Best Practice:
Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.
Forward scale
So from here,
how can we expect this system to scale?
Forward scale
• Run more EMR instances to build Feature Matrix
• Run more SWF predict workersper campaign
for 10x users
Forward scale
• already automatically start a SWF worker group per campaign
• “user generated campaigns” require no campaigner time and are targeted automatically
for 10x campaigns
Forward scale
• system eliminates mass email targeting contention, so team can scale
for 2x+ campaigners
Win for our Campaigners... and Users.
Our user base can now be automatically segmented across a wide pool of campaigns, even internationally.
30%+ conversion boost over manual targeting.
Do you build systems like these?Do you want to?
We’d love to talk.(And yes, we’re hiring.)
UNSILODr. Francisco Roque, Co-Founder and CTO
A collaborative search platform that helps you see patterns across Science & Innovation
Mission
UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific
terminologies
Describe Discover Analyze & Share
Unsilo
New way of searching
Big Data Challenges
4.5 million USPTO granted patents
12 million scientific articles
Heterogeneous processing pipeline
(multiple steps, variable times)
A small test
1000 documents20 minutes/doc average
A bigger test
100k documents3.8 years?
A bigger test
100k documents8x8 cores~21 days
4.5 million patents?12 million articles?
Focus on the goal
Amazon SWF to the rescue
• Scaling• Concurrency• Reliability• Flexibility to experiment• Easily adaptable
SWF makes it very easy to separate algorithmic logic and workflow logic
Easy to get started: First document batch running in just 2 weeks
AWS services
Adding content
Job Loading
• Content loaded by traversing S3 buckets
• Reprocessing by traversing tables on DynamoDB
DynamoDB
Decision Workers
• Crawls Workflow Historyfor Decision Tasks
• Schedules new ActivityTasks
DynamoDB
Activity Workers
• Read/write to S3• Status in DynamoDB• SWF task inputs passed
between workflow steps • Specialized workers
DynamoDB
Best practice
Use DynamoDB for content status
Index on different columns (local indexes)
More efficient content status queriesGive me all the items that completed step X
Elastic service!
Key to scalability
File organization on S3 for scalability– 50 req/s naïve approach– >1500 req/seq
logs/2013-11-14T23:01:34/...logs/2013-11-14T23:01:23/...logs/2013-11-14T23:01:15/..."
43:10:32T41-11-3102/logs/...32:10:32T41-11-3102/logs/...51:10:32T41-11-3102/logs/..."
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.htmlhttp://goo.gl/JnaQZV
Gearing
Ratio?
Monitoring
Give me all the workers/instances that have not responded in the past hour
Amazon SWF components
DynamoDB
Throttling and eventual consistency
Failed?Try Again
Development environment
Huge benefits
100k Documents21 days < 1 hour
4.5 Million USPTO~30 hours
Huge benefits
Focus on our goal, faster time to market
Using Spot instances, 1/10 cost
Key SWF Takeaways
Flexibility– Room for experimentation
Transparency– Easy to adapt
Growing with the system– Not constrained by the framework
Decider
Worker
Worker
Amazon SWF
UNSILOwww.unsilo.com
Sign up to be invited for the Public Beta
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
Compute Resources
Data Data
Data Stores Data Stores
AWS Data Pipeline Your ETL in the Cloud
Inter-region ETL
S3 EMR S3 DynamoDBEMRS3 S3 RDSEC2
S3 RedshiftEMR DynamoDBEMRDynamoDB S3 Hive/Pig Redshift
Intra-region ETL Cloud-On-Prem ETL
AWS Data Pipeline Patterns (ActivityWorkers)
Fred Benenson, Data Engineer
A new way to fund creative projects:
All-or-nothing fundraising.
5.1 million people have backed a project
51,000+ successful projects
44% of projects hit their goal
$872 million pledged
78% of projects raise under $10,000
51 projects raised more than $1 million
Project case study: Oculus Rift
Data @
• We have many different data sources
• Some relational data, like MySQL on Amazon RDS
• Other unstructured data like JSON stored in a
third-party service like Mixpanel
• What if we want to JOIN between them in Amazon
Redshift?
Case study: Find the users that have Page View A but not User Action B
• Page View A is instrumented in Mixpanel, a third-party service whose API we have access:
{ “Page View A”, { user_uid : 1231567, ... } }
• But User Action B is just the existence of a timestamp in a MySQL row:
6975, User Action B, 1231567, 2012-08-31 21:55:466976, User Action B, 9123811, NULL6977, User Action B, 2913811, NULL
Redshift to the Rescue!SELECTusers.id,COUNT(DISTINCTCASE WHEN user_actions.timestamp IS NOT NULLTHEN user_actions.id ELSE NULL
END) as event_b_countFROM usersINNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A'
LEFT JOIN user_actions ON user_actions.user_id = users.idGROUP BY users.id
How we do automate the data flow to keep it fresh daily?
But how do we get the data to Redshift?
This is where AWSData Pipeline comes in.
Pipeline 1: RDS to Redshift - Step 1
First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.
AWS
Pipeline 1: RDS to Redshift - Step 2
Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.
• 150 - 200 gigabytes• New DB every day, drop old tables
• Using AWS Data Pipeline’s 1-day ‘now’ schedule
Pipeline 1: RDS to Redshift - Transfer to S3
Pipeline 1: RDS to Redshift Again
Run a similar pipeline job in parallel for our other database.
Pipeline 2: Mixpanel to Redshift - Step 1
Spin up an EC2 instance to download the day’s data from Mixpanel.
Use Elastic MapReduce to transform Mixpanel’s unstructured JSON into CSVs.
Pipeline 2: Mixpanel to Redshift - Step 2
• 9-10 gb per day• Incremental data• 2.2+ billion events• Backfilled a year in 7 days
Pipeline 2: Mixpanel to Redshift - Transfer to S3
• JSON / CLI tools are crucial• Build scripts to generate JSON• ShellCommandActivity is powerful• Really invest time to understand
scheduling• Use S3 as the “transport” layer
AWS Data PipelineBest Practices
AWS Data Pipeline Takeaways for Kickstarter
15 years ago: $1 million or more
5 years ago: Open source + staff & infrastructure
Now: ~$80 a month on AWS
“It just works”
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
Big Thank You to Customer Speakers!
Jinesh Varia
@jinman
More Sessions on SWF and AWS Data Pipeline
SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room)
BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
SVC201