capturing & processing real-time data on aws

32
@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Upload: vonhu

Post on 09-Dec-2016

225 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Page 2: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Agenda

¨  Real-Time Analytics ¤ Data Ingestion ¤ Data Processing

n Architecture n AWS Lambda

¨  Customer Implementations

Page 3: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Real-Time Analytics

Real-time Ingest!•  Highly Scalable"•  Durable"•  Elastic "•  Replay-able Reads""

Continuous Processing FX !•  Load-balancing incoming streams"•  Fault-tolerance, Checkpoint / Replay"•  Elastic"•  Enable multiple apps to process in parallel"

Continuous data flow!

Low end-to-end latency!

Continuous, real-time workloads!

+

Page 4: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Data Ingestion

Page 5: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Global top-10

foo-analysis.com

Starting simple...

Page 6: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Global top-10 Elastic Beanstalk foo-analysis.com

Distributing the workload…

Page 7: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Global top-10

Elastic Beanstalk foo-analysis.com

Local top-10

Local top-10

Local top-10

Or using a Elastic Data Broker…

Page 8: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Global top-10

Elastic Beanstalk foo-analysis.com

K I N E S I S

Data Record

Stream Shard

Partition Key

Worker

My top-10

Data Record Sequence Number

14 17 18 21 23

Amazon Kinesis – Managed Stream

Page 9: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

AWS

Endp

oint

S3

DynamoDB

Redshift

Data Sources

Availability Zone

Availability Zone

Data Sources

Data Sources

Data Sources

Data Sources

Availability Zone

Shard 1

Shard 2

Shard N

[Data Archive]

[Metric Extraction]

[Sliding Window Analysis]

[Machine Learning]

App. 1

App. 2

App. 3

App. 4

EMR

Amazon Kinesis – Common Data Broker

Page 10: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Amazon Kinesis – Distributed Streams ¨  From batch to continuous processing

¨  Scale shards elastically UP or DOWN without losing sequencing

¨  Workers can replay records for up to 24 hours

¨  Scale up to GB/sec without losing durability •  Records stored across multiple availability zones

¨  Multiple parallel Kinesis Apps output to anything… •  RDBMS, S3, In-house Data Warehouse, Messaging, another stream, JavaSDK, PythonSDK, etc.

Page 11: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Data Processing

Page 12: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Batch

Micro Batch

Real Time

Emerging Architecture…

Batch Analysis

DW Hadoop

Notifications

& Alerts

Dashboards/ visualizations

APIs Streaming Analytics

Data Streams

Deep Learning

Dashboards/ visualizations

Spark Storm KCL

Data Archive

Page 13: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Real-time: Event-based processing

Kinesis  Storm  Spout  

Producer  Amazon    Kinesis  

Apache  Storm  

Elas7Cache  (Redis)   Node.js   Client  

(D3)  

hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-­‐a-­‐Real-­‐7me-­‐Sliding-­‐Window-­‐Applica7on-­‐Using-­‐Amazon-­‐Kinesis-­‐and-­‐Apache    

Page 14: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Micro-Batches: Drip feeding the data

hAp://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-­‐Prac7ces-­‐for-­‐Micro-­‐Batch-­‐Loading-­‐on-­‐Amazon-­‐RedshiY    

Page 15: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

 

Offline    Analysis  

Ad-­‐hoc  Analysis  

 

Offline Batch: Hadoop for discovery

EMR  S3  Kinesis  Applica7on  Producer   Amazon  Kinesis  

EMR

Hive Pig

Cascading MapReduce

hAp://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-­‐a-­‐Real-­‐7me-­‐Sliding-­‐Window-­‐Applica7on-­‐Using-­‐Amazon-­‐Kinesis-­‐and-­‐Apache    

Amazon  Kinesis  

Page 16: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Batch

Micro Batch

Real Time

Putting it together…

Producer   Amazon  Kinesis   App   Client  

EMR  S3  

KCL  

Apache  Storm   DynamoDB  

RedshiY  

BI  Tools  

KCL  

Page 17: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

AWS Lambda

An event-driven computing service for dynamic applications

“AWS  Lambda  func/ons  can  be  triggered  by  data  stream  updates  from  Amazon  Kinesis  and  Amazon  DynamoDB.  For  instance,  you  can  watch  for  a  paBern,  such  as  an  address,  and  trigger  an  alert.”  

Page 18: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

A focus on functions, data and events

Cloud  func7ons  

S3 event notifications

DynamoDB Streams

Kinesis events

Custom events

Page 19: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Stream processing

Data triggers Server-free back-end

IoT Indexing & synchronization

Putting AWS Lambda to work

Page 20: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Photo bucket S3

Metadata DynamoDB

Trending DynamoDB

Extract Metadata

Cloud Function

Trending Cloud

Function

NotifyCloud Function

SNS Push notification

AWS Lambda for reactive computing

Page 21: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Processing Events from Kinesis

hAp://docs.aws.amazon.com/lambda/latest/dg/walkthrough-­‐kinesis-­‐events-­‐adminuser.html    

Write  million  of  events  from  Kinesis  into  Elas7search  with  only  60  lines  of  code!!!    hAps://gist.github.com/tylr/e8baf45c07ced23ef013      

Page 22: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Customer deployments on AWS

Page 23: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

GREE International – re:Invent 2014

¨  GAM301 - Real-Time Game Analytics with Amazon Kinesis, Redshift, and DynamoDB

¨  Session - https://www.youtube.com/watch?v=ElpWlj6yi44

¨  Slide: http://www.slideshare.net/AmazonWebServices/gam301-realtime-game-analytics-with-amazon-kinesis-amazon-redshift-and-amazon-dynamodb-aws-reinvent-2014

Page 24: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Key Requirements for Analytics

Initial Requreiments

¨  Data collection & streaming to database

¨  Zero data loss ¨  Zero data corruption ¨  Guaranteed data

delivery

New Requirements

¨  Near real-time data latency

¨  Real-time ad-hoc analysis

¨  Ease of adding consumers

¨  Managed Service

Page 25: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Data Collection

Source of Data ¨  Mobile Devices ¨  Game Servers ¨  Ad Networks

Data Sizes ¨  Size of event ~ 1 KB ¨  500M+ events/day ¨  500G+/day &

growing ¨  JSON format

Page 26: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Architecture

Page 27: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

SocialMetrix – re:Invent 2014

¨  ARC202: Real-World Real-Time Analytics ¨  Session:

https://www.youtube.com/watch?v=NIa33ZwFa8E ¨  Slides:

http://www.slideshare.net/zer0/arc202-arc202-real-world-real-time-analytics20141109mhfinaledit

Page 28: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Drivers for architecture evolution

•  More customers, bigger customers

•  Add new features

•  Keep costs under control

Page 29: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Requirements at 4th iteration ¤ Monitor millions of social media profiles

¤ Make data accessible (exploration, PoC)

¤  Improve UI response times

¤  Testing our data pipelines

¤  Reprocessing (faster)

Page 30: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Architecture

Page 31: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

-

20

40

60

80

100

120

140

160

0

20

40

60

80

100

120

#1 #2 #3 #4

Act

ive

Cus

tom

ers

Costs Customers

Cost over Architecture…

Page 32: CAPTURING & PROCESSING REAL-TIME DATA ON AWS

THANK YOU !!! http://aws.amazon.com/big-data