analyzing big data on the fly - qcon new york · 2014-08-27 · core concepts recapped • data...

57
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Analyzing Big Data on the Fly Shawn Gandhi, Solutions Architect @shawnagram

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Analyzing Big Data on the Fly

Shawn Gandhi, Solutions Architect @shawnagram

Page 2: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter
Page 3: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

{! "payerId": "Joe",! "productCode": "AmazonS3",! "clientProductCode": "AmazonS3",! "usageType": "Bandwidth",! "operation": "PUT",! "value": "22490",! "timestamp": "1216674828"!}!

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326!

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"]!

Syslog Entry

“SeattlePublicWater/Kinesis/123/Realtime” – 412309129140!

MQTT Record <R,AMZN ,T,G,R1>!

NASDAQ OMX Record

Page 4: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 5: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Abraham Wald (1902-1950)

Page 6: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter
Page 7: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter
Page 8: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Big Data: Best Served Fresh Big Data

•  Hourly server logs: were your systems misbehaving 1hr ago

•  Weekly / Monthly Bill: what you spent this billing cycle

•  Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time

•  Daily fraud reports: was there fraud yesterday

Real-time Big Data

•  Amazon CloudWatch metrics: what went wrong now

•  Real-time spending alerts/caps: prevent overspending now

•  Real-time analysis: what to offer the current customer now

•  Real-time detection: block fraudulent use now

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Page 9: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Talk outline •  A tour of Kinesis concepts in the context of a Twitter

Trends service

•  Implementing the Twitter Trends service

•  Kinesis in the broader context of a Big Data eco-system

Page 10: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Why did we make this?

Page 11: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client Library + Connector Library

Apache Storm

Amazon Elastic MapReduce

Sending Reading

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Page 12: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

http://bit.ly/1prQaWb

Page 13: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

A tour of Kinesis concepts in the context of a Twitter Trends service

Page 14: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

Elastic Beanstalk

twitter-trends.com

twitter-trends.com website

Page 15: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

Too big to handle on one box

Page 16: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

The solution: streaming map/reduce

My top-10

My top-10

My top-10

Global top-10

Page 17: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

Core concepts

My top-10

My top-10

My top-10

Global top-10

Data record Stream

Partition key

Shard Worker

Shard: 14 17 18 21 23

Data record

Sequence number

Page 18: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

How this relates to Kinesis

Kinesis Kinesis application

Page 19: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Core Concepts recapped •  Data record ~ tweet

•  Stream ~ all tweets (the Twitter Firehose)

•  Partition key ~ Twitter topic (every tweet record belongs to exactly one of these)

•  Shard ~ all the data records belonging to a set of Twitter topics that will get grouped together

•  Sequence number ~ each data record gets one assigned when first ingested.

•  Worker ~ processes the records of a shard in sequence number order

Page 20: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

twitter-trends.com

Using the Kinesis API directly

K I N E S I S

iterator = getShardIterator(shardId, LATEST); while (true) { [records, iterator] = getNextRecords(iterator, maxRecsToReturn); process(records); }

process(records): { for (record in records) { updateLocalTop10(record); } if (timeToDoOutput()) { writeLocalTop10ToDDB(); } }

while (true) { localTop10Lists = scanDDBTable(); updateGlobalTop10List( localTop10Lists); sleep(10); }

Page 21: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

Challenges with using the Kinesis API directly

Kinesis application

Manual creation of workers and assignment to shards

How many workers per EC2 instance? How many EC2 instances?

Page 22: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

Using the Kinesis Client Library

Kinesis application

Shard mgmt table

Page 23: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

Elasticity and load balancing using Auto-Scaling Groups

Shard mgmt table

Auto scaling Group

Page 24: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

Dynamic growth and hot spot management through stream resharding

Shard mgmt table

Auto scaling Group

Page 25: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

The challenges of fault tolerance

X X X Shard mgmt

table

Page 26: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

K I N E S I S

twitter-trends.com

Fault tolerance support in KCL

Shard mgmt table

X Availability Zone

1

Availability Zone 3

Page 27: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Checkpoint, replay design pattern Kinesis

14 17 18 21 23

Shard-i 2 3 5 8 10

Shard ID

Lock Seq num

Shard-i

Host A

Host B

Shard ID Local top-10

Shard-i

0

10

18 X 2

3

5

8

10

14

17 18

21 23

0

3 10

Host A Host B

{#Obama: 10235, #Seahawks: 9835, …} {#Obama: 10235, #Congress: 9910, …}

10 23

14 17

18 21

23

Page 28: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Catching up from delayed processing Kinesis

14 17 18 21 23

Shard-i 2 3 5 8 10

Host A

Host B X 2

3

5

2

3

5

8

10

14 17

18 21

23

How long until we’ve caught up?

Shard throughput SLA: -  1 MBps in -  2 MBps out => Catch up time ~ down time

Unless your application can’t process 2 MBps on a single host

=> Provision more shards

Page 29: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Concurrent processing applications Kinesis

14 17 18 21 23

Shard-i 2 3 5 8 10

twitter-trends.com

spot-the-revolution.org

0 18

2

3

5

8

10

14

17 18

21 23

3 10

0 23

14 17

18 21

23

23

2

3

5

8

10

10

Page 30: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Why not combine applications? Kinesis

twitter-trends.com

twitter-stats.com

Output state & checkpoint every 10 secs

Output state & checkpoint every hour

Page 31: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Implementing twitter-trends.com

Page 32: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Creating a Kinesis Stream

Page 33: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

How many shards? Additional considerations:

-  How much processing is needed per data record/data byte?

-  How quickly do you need to catch up if one or more of your workers stalls or fails?

You can always dynamically reshard to adjust to changing workloads

Page 34: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Twitter Trends Shard Processing Code Class TwitterTrendsShardProcessor implements IRecordProcessor {

public TwitterTrendsShardProcessor() { … }

@Override

public void initialize(String shardId) { … }

@Override

public void processRecords(List<Record> records,

IRecordProcessorCheckpointer checkpointer) { … }

@Override

public void shutdown(IRecordProcessorCheckpointer checkpointer,

ShutdownReason reason) { … }

}

Page 35: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Twitter Trends Shard Processing Code Class TwitterTrendsShardProcessor implements IRecordProcessor { private Map<String, AtomicInteger> hashCount = new HashMap<>(); private long tweetsProcessed = 0; @Override public void processRecords(List<Record> records, IRecordProcessorCheckpointer checkpointer) { computeLocalTop10(records); if ((tweetsProcessed++) >= 2000) { emitToDynamoDB(checkpointer); } } }

Page 36: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Twitter Trends Shard Processing Code private void computeLocalTop10(List<Record> records) { for (Record r : records) { String tweet = new String(r.getData().array()); String[] words = tweet.split(" \t"); for (String word : words) { if (word.startsWith("#")) { if (!hashCount.containsKey(word)) hashCount.put(word, new AtomicInteger(0)); hashCount.get(word).incrementAndGet(); } } } private void emitToDynamoDB(IRecordProcessorCheckpointer checkpointer) { persistMapToDynamoDB(hashCount); try { checkpointer.checkpoint(); } catch (IOException | KinesisClientLibDependencyException | InvalidStateException | ThrottlingException | ShutdownException e) { // Error handling } hashCount = new HashMap<>(); tweetsProcessed = 0; }

Page 37: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Kinesis as a gateway into the Big Data eco-system

Page 38: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Big Data has a Life Cycle

twitter-trends.com Twitter trends processing application

Twitter trends statistics

Twitter trends anonymized archive

Page 39: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Connector framework public interface IKinesisConnectorPipeline<T> {

IEmitter<T> getEmitter(KinesisConnectorConfiguration configuration);

IBuffer<T> getBuffer(KinesisConnectorConfiguration configuration);

ITransformer<T> getTransformer(

KinesisConnectorConfiguration configuration);

IFilter<T> getFilter(KinesisConnectorConfiguration configuration);

}

Page 40: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Transformer & Filter interfaces public interface ITransformer<T> {

public T toClass(Record record) throws IOException;

public byte[] fromClass(T record) throws IOException;

}

public interface IFilter<T> {

public boolean keepRecord(T record);

}

Page 41: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Tweet transformer for S3 archival

public class TweetTransformer implements ITransformer<TwitterRecord> {

private final JsonFactory fact =

JsonActivityFeedProcessor.getObjectMapper().getJsonFactory();

@Override

public TwitterRecord toClass(Record record) {

try {

return fact.createJsonParser(

record.getData().array()).readValueAs(TwitterRecord.class);

} catch (IOException e) {

throw new RuntimeException(e);

}

}

Page 42: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Tweet transformer (cont.) @Override public byte[] fromClass(TwitterRecord r) { SimpleDateFormat df = new SimpleDateFormat("YYYY-MM-dd HH:MM:SS"); StringBuilder b = new StringBuilder();

b.append(r.getActor().getId()).append("|")

.append(sanitize(r.getActor().getDisplayName())).append("|") .append(r.getActor().getFollowersCount()).append("|") .append(r.getActor().getFriendsCount()).append("|”)

... .append('\n');

return b.toString().replaceAll("\ufffd\ufffd\ufffd", "").getBytes(); }

}

Page 43: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Buffer interface public interface IBuffer<T> { public long getBytesToBuffer(); public long getNumRecordsToBuffer(); public boolean shouldFlush(); public void consumeRecord(T record, int recordBytes, String sequenceNumber); public void clear(); public String getFirstSequenceNumber(); public String getLastSequenceNumber();

public List<T> getRecords(); }

Page 44: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Tweet aggregation buffer for Redshift ... @Override consumeRecord(TwitterAggRecord record, int recordSize, String sequenceNumber) {

if (buffer.isEmpty()) { firstSequenceNumber = sequenceNumber; } lastSequenceNumber = sequenceNumber; TwitterAggRecord agg = null; if (!buffer.containsKey(record.getHashTag())) { agg = new TwitterAggRecord(); buffer.put(agg); byteCount.addAndGet(agg.getRecordSize()); } agg = buffer.get(record.getHashTag()); agg.aggregate(record);

}

Page 45: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Tweet Redshift aggregation transformer public class TweetTransformer implements ITransformer<TwitterAggRecord> {

private final JsonFactory fact = JsonActivityFeedProcessor.getObjectMapper().getJsonFactory();

@Override public TwitterAggRecord toClass(Record record) { try { return new TwitterAggRecord( fact.createJsonParser(

record.getData().array()).readValueAs(TwitterRecord.class)); } catch (IOException e) { throw new RuntimeException(e); } }

Page 46: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Tweet aggregation transformer (cont.) @Override public byte[] fromClass(TwitterAggRecord r) { SimpleDateFormat df = new SimpleDateFormat("YYYY-MM-dd HH:MM:SS"); StringBuilder b = new StringBuilder();

b.append(r.getActor().getId()).append("|")

.append(sanitize(r.getActor().getDisplayName())).append("|") .append(r.getActor().getFollowersCount()).append("|") .append(r.getActor().getFriendsCount()).append("|”)

... .append('\n');

return b.toString().replaceAll("\ufffd\ufffd\ufffd", "").getBytes(); }

}

Page 47: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Emitter interface public interface IEmitter<T> {

List<T> emit(UnmodifiableBuffer<T> buffer) throws IOException;

void fail(List<T> records);

void shutdown();

}

Page 48: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Example Redshift connector

K I N E S I S

Amazon Redshift

Amazon S3

Page 49: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Example Redshift connector – v2

K I N E S I S

Amazon Redshift

Amazon S3

K I N E S I S

Page 50: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

50

Easy  Administra-on      Managed  service  for  real-­‐1me  streaming  data  collec1on,  processing  and  analysis.  Simply  create  a  new  stream,  set  the  desired  level  of  capacity,  and  let  the  service  handle  the  rest.        

Real-­‐-me  Performance        Perform  con1nual  processing  on  streaming  big  data.  Processing  latencies  fall  to  a  few  seconds,  compared  with  the  minutes  or  hours  associated  with  batch  processing.            

Elas-c        Seamlessly  scale  to  match  your  data  throughput  rate  and  volume.  You  can  easily  scale  up  to  gigabytes  per  second.  The  service  will  scale  up  or  down  based  on  your  opera1onal  or  business  needs.      

S3,  Redshi:,  &  DynamoDB  Integra-on      Reliably  collect,  process,  and  transform  all  of  your  data  in  real-­‐1me  &  deliver  to  AWS  data  stores  of  choice,  with  Connectors  for  S3,  RedshiF,  and  DynamoDB.          

Build  Real-­‐-me  Applica-ons      Client  libraries  that  enable  developers  to  design  and  operate  real-­‐1me  streaming  data  processing  applica1ons.                  

Low  Cost      Cost-­‐efficient  for  workloads  of  any  scale.  You  can  get  started  by  provisioning  a  small  stream,  and  pay  low  hourly  rates  only  for  what  you  use.              

Amazon Kinesis: Key Developer Benefits

Page 51: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Thank YouYou can follow us at

http://aws.amazon.com/kinesis @shawnagram #kinesis

Page 52: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Implementing spot-the-revolution.org

Page 53: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

spot-the-revolution.org – version 1

Twitter Firehose

spot-the-revolution Storm

application Kafka cluster

ZooKeeper

X X . . . . . $$

Page 54: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

spot-the-revolution.org – Kinesis version

Storm connector

1-minute statistics

Anonymized archive

spot-the-revolution Storm

application on EC2

Page 55: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Generalizing the design pattern

Page 56: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Clickstream analytics example

Clickstream processing application

Aggregate clickstream statistics

Clickstream archive

Clickstream trend analysis

Page 57: Analyzing Big Data on the Fly - QCon New York · 2014-08-27 · Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter

Simple metering & billing example

Billing auditors

Incremental bill computation

Metering record archive

Billing mgmt service