aws user group uk: why your company needs a unified log

22
Why your company needs a Unified Log AWS User Group UK, 28 th January 2015

Upload: alexander-dean

Post on 16-Jul-2015

110 views

Category:

Software


0 download

TRANSCRIPT

Why your company needs a Unified Log

AWS User Group UK, 28th January 2015

Introducing myself

• Alex Dean

• Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]

• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]

[1] https://github.com/snowplow/snowplow

[2] http://manning.com/dean

So what is a Unified Log?

A quick history lesson: the three eras of business data processing [1]

1. The classic era, 1996+

2. The hybrid era, 2005+

3. The unified era, 2013+

[1] http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/

The classic era of business data processing, 1996+

OWN DATA CENTER

Data warehouse

HIGH LATENCY

Point-to-point connections

WIDE DATA

COVERAGE

CMS

Silo

CRM

Local loop Local loop

NARROW DATA SILOES LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

Management reporting

ERP

SiloLocal loop

Silo

Nightly batch ETL process

FULL DATA

HISTORY

The hybrid era, 2005+

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: a surfeit of software vendors

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: company-wide reporting and analytics ends up like Rashomon

The bandit’s story

vs.

The wife’s story

vs.

The samurai’s story

vs.

The woodcutter’s story

The hybrid era: the number of data integrations is unsustainable

So how do we unravel the hairball?

The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

The unified log is Amazon Kinesis, or Apache Kafka

• Amazon Kinesis, a hosted AWS service

• Extremely similar semantics to Kafka

• Apache Kafka, an append-only, distributed, ordered commit log

• Developed at LinkedIn to serve as their organization’s unified log

“Kafka is designed to allow a single cluster to serve as the central data backbone for a

large organization” [1]

[1] http://kafka.apache.org/

So what does a unified log give us?

A single version of the truth

Our truth is now upstream from the data warehouse

The hairball of point-to-point connections has been unravelled

Local loops have been unbundled

1

2

3

4

What does a unified log let us do that we couldn’t do before?

Populating a unified log with your company’s event streams

Real-time management

reporting

To enable…

Holistic systems

monitoring

Re-running models from

Day 0

A/B testing end-to-end

pipelines

Shipping offline

models to RT

… anything requiring low latency response / holistic view of our company’s data!

How are we embracing the unified log at Snowplow?

Some background: early on, we decided that Snowplow should be composed of a set of loosely coupled subsystems

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

D = Standardised data protocols

Generate event data from any environment

Log raw events from trackers

Validate and enrich raw events

Store enriched events ready for analysis

Analyzeenriched events

These turned out to be critical to allowing us to evolve the above stack

Today most users are running a batch-based Snowplow configuration

Hadoop-based

enrichment

Snowplow event

tracking SDK

Amazon Redshift

Amazon S3

HTTP-based event

collector

• Batch-based• Normally run overnight;

sometimes every 4-6 hoursThe Snowplow batch-based flow uses Amazon S3 as a “poor man’s” unified log

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

Can we implement Snowplow on top of Kinesis/Kafka?

We are working on Amazon Kinesis support first; Apache Kafka + Samza will come later this year

scala-stream-collector

scala-kinesis-enrich

S3 Amazon Redshift

S3 sink Kinesis app

Redshift sink

Kinesis app

Snowplow Trackers

= not yet released

kinesis-elasticsearch-

sink

DynamoDBElastic-search

Event aggregator Kinesis app

Analytics on Read for agile exploration of events, machine

learning, auditing, re-processing…

Raw event

stream

Bad raw event

stream

Enriched event

stream

Google BigQuery

kinesis-bigquery-

sink

Analytics on Write (for dashboarding, audience segmentation, RTB, etc)

Snowplow users can already write stream processing applications which leverage the Snowplow enriched event stream

scala-stream-collector

scala-kinesis-enrich

AWS LambdaApache Storm

Snowplow Trackers

Apache Samza

Raw event

stream

Bad raw event

stream

Enriched event

stream

Apache Spark

Streaming

Kinesis Client Library

Questions?

http://snowplowanalytics.com

https://github.com/snowplow/snowplow

@snowplowdata

To meet up or chat, @alexcrdean on Twitter or [email protected]

Discount code: ulogprugcf (43% off Unified Log Processing eBook)