capturing & processing real-time data on aws
TRANSCRIPT
@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
CAPTURING & PROCESSING REAL-TIME DATA ON AWS
Agenda
¨ Real-Time Analytics ¤ Data Ingestion ¤ Data Processing
n Architecture n AWS Lambda
¨ Customer Implementations
Real-Time Analytics
Real-time Ingest!• Highly Scalable"• Durable"• Elastic "• Replay-able Reads""
Continuous Processing FX !• Load-balancing incoming streams"• Fault-tolerance, Checkpoint / Replay"• Elastic"• Enable multiple apps to process in parallel"
Continuous data flow!
Low end-to-end latency!
Continuous, real-time workloads!
+
Data Ingestion
Global top-10
foo-analysis.com
Starting simple...
Global top-10 Elastic Beanstalk foo-analysis.com
Distributing the workload…
Global top-10
Elastic Beanstalk foo-analysis.com
Local top-10
Local top-10
Local top-10
Or using a Elastic Data Broker…
Global top-10
Elastic Beanstalk foo-analysis.com
K I N E S I S
Data Record
Stream Shard
Partition Key
Worker
My top-10
Data Record Sequence Number
14 17 18 21 23
Amazon Kinesis – Managed Stream
AWS
Endp
oint
S3
DynamoDB
Redshift
Data Sources
Availability Zone
Availability Zone
Data Sources
Data Sources
Data Sources
Data Sources
Availability Zone
Shard 1
Shard 2
Shard N
[Data Archive]
[Metric Extraction]
[Sliding Window Analysis]
[Machine Learning]
App. 1
App. 2
App. 3
App. 4
EMR
Amazon Kinesis – Common Data Broker
Amazon Kinesis – Distributed Streams ¨ From batch to continuous processing
¨ Scale shards elastically UP or DOWN without losing sequencing
¨ Workers can replay records for up to 24 hours
¨ Scale up to GB/sec without losing durability • Records stored across multiple availability zones
¨ Multiple parallel Kinesis Apps output to anything… • RDBMS, S3, In-house Data Warehouse, Messaging, another stream, JavaSDK, PythonSDK, etc.
Data Processing
Batch
Micro Batch
Real Time
Emerging Architecture…
Batch Analysis
DW Hadoop
Notifications
& Alerts
Dashboards/ visualizations
APIs Streaming Analytics
Data Streams
Deep Learning
Dashboards/ visualizations
Spark Storm KCL
Data Archive
Real-time: Event-based processing
Kinesis Storm Spout
Producer Amazon Kinesis
Apache Storm
Elas7Cache (Redis) Node.js Client
(D3)
hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-‐a-‐Real-‐7me-‐Sliding-‐Window-‐Applica7on-‐Using-‐Amazon-‐Kinesis-‐and-‐Apache
Micro-Batches: Drip feeding the data
hAp://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-‐Prac7ces-‐for-‐Micro-‐Batch-‐Loading-‐on-‐Amazon-‐RedshiY
Offline Analysis
Ad-‐hoc Analysis
Offline Batch: Hadoop for discovery
EMR S3 Kinesis Applica7on Producer Amazon Kinesis
EMR
Hive Pig
Cascading MapReduce
hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-‐a-‐Real-‐7me-‐Sliding-‐Window-‐Applica7on-‐Using-‐Amazon-‐Kinesis-‐and-‐Apache
Amazon Kinesis
Batch
Micro Batch
Real Time
Putting it together…
Producer Amazon Kinesis App Client
EMR S3
KCL
Apache Storm DynamoDB
RedshiY
BI Tools
KCL
AWS Lambda
An event-driven computing service for dynamic applications
“AWS Lambda func/ons can be triggered by data stream updates from Amazon Kinesis and Amazon DynamoDB. For instance, you can watch for a paBern, such as an address, and trigger an alert.”
A focus on functions, data and events
Cloud func7ons
S3 event notifications
DynamoDB Streams
Kinesis events
Custom events
Stream processing
Data triggers Server-free back-end
IoT Indexing & synchronization
Putting AWS Lambda to work
Photo bucket S3
Metadata DynamoDB
Trending DynamoDB
Extract Metadata
Cloud Function
Trending Cloud
Function
NotifyCloud Function
☺
SNS Push notification
AWS Lambda for reactive computing
Processing Events from Kinesis
hAp://docs.aws.amazon.com/lambda/latest/dg/walkthrough-‐kinesis-‐events-‐adminuser.html
Write million of events from Kinesis into Elas7search with only 60 lines of code!!! hAps://gist.github.com/tylr/e8baf45c07ced23ef013
Customer deployments on AWS
GREE International – re:Invent 2014
¨ GAM301 - Real-Time Game Analytics with Amazon Kinesis, Redshift, and DynamoDB
¨ Session - https://www.youtube.com/watch?v=ElpWlj6yi44
¨ Slide: http://www.slideshare.net/AmazonWebServices/gam301-realtime-game-analytics-with-amazon-kinesis-amazon-redshift-and-amazon-dynamodb-aws-reinvent-2014
Key Requirements for Analytics
Initial Requreiments
¨ Data collection & streaming to database
¨ Zero data loss ¨ Zero data corruption ¨ Guaranteed data
delivery
New Requirements
¨ Near real-time data latency
¨ Real-time ad-hoc analysis
¨ Ease of adding consumers
¨ Managed Service
Data Collection
Source of Data ¨ Mobile Devices ¨ Game Servers ¨ Ad Networks
Data Sizes ¨ Size of event ~ 1 KB ¨ 500M+ events/day ¨ 500G+/day &
growing ¨ JSON format
Architecture
SocialMetrix – re:Invent 2014
¨ ARC202: Real-World Real-Time Analytics ¨ Session:
https://www.youtube.com/watch?v=NIa33ZwFa8E ¨ Slides:
http://www.slideshare.net/zer0/arc202-arc202-real-world-real-time-analytics20141109mhfinaledit
Drivers for architecture evolution
• More customers, bigger customers
• Add new features
• Keep costs under control
Requirements at 4th iteration ¤ Monitor millions of social media profiles
¤ Make data accessible (exploration, PoC)
¤ Improve UI response times
¤ Testing our data pipelines
¤ Reprocessing (faster)
Architecture
-
20
40
60
80
100
120
140
160
0
20
40
60
80
100
120
#1 #2 #3 #4
Act
ive
Cus
tom
ers
Costs Customers
Cost over Architecture…
THANK YOU !!! http://aws.amazon.com/big-data