field notes from expeditions in the cloud-(matt wood, amazon web services)
TRANSCRIPT
SPARK WITH AMAZON EMRProduction workloads on AWS
Security event streaming
Machine learning &ad targeting
Web analytics
Revenue forecasting
Ad targeting & recommendations
PersonalizationPredictive marketing
App search
S3 MLlib on Spark
Clicks, videos,interactives
S3
Customer segments
Forecasting
Ad sales
Map userto ads
S3 MLlib on Spark
Clicks, videos,interactives
S3
Customer segments
Forecasting
Ad sales
Map userto ads
Recommendation Spark cluster
Clicks, videos,interactives
S3 MLlib on Spark
Clicks, videos,interactives
S3
Customer segments
Forecasting
Ad sales
Map userto ads
Recommendation Spark cluster
Clicks, videos,interactives
Kafka
Every click
Node.js app Kinesis
Click stream
Elastic Beanstalk
Spark Streaming
EMR cluster
5 minutewindows
100MB
S3
JSON and CSV
Node.js app Kinesis
Click stream
Elastic Beanstalk
Spark Streaming
EMR cluster
5 minutewindows
100MB
S3
JSON and CSV
EMRElasticSearch
Redshift
Node.js app Kinesis
Click stream
Elastic Beanstalk
Spark Streaming
EMR cluster
5 minutewindows
100MB
S3
JSON and CSV
EMRElasticSearch
Redshift
S3
Ad impressions& clicks
24/7 Sparkcluster
ImpressionRDD Interactive
dashboard
Revenueforecast
Click streamlogs
Batch Sparkclusters
Redshift
S3
Ad impressions& clicks
24/7 Sparkcluster
ImpressionRDD Interactive
dashboard
Revenueforecast
Click streamlogs
Batch Sparkclusters
Redshift
Data explorationand testing
RAPID PROVISIONING OF ELASTIC CLUSTERS
Provision new clusters in minutes
High memory, high CPU, high IO instances
Access cluster instances directly
Clusters run within a VPC
Add or remove capacity on running clusters
DIRECT ACCESS TO DATA ON S3
Access objects directly on Amazon S3
Server-side and client-side encryption with customer controlled keys
Multiple clusters can access canonical data in S3
No need to copy or manage the data on the cluster
Mix S3 and HDFS on a cluster
INTEGRATION WITH THE SPOT MARKET
Bid on under utilized capacity on EC2
“Name your price” clusters
Very low cost at high scale
Lowest cost for time insensitive workloads
Also on-demand and reserve capacity pricing