a real-time architecture using hadoop & storm - nathan bijnens & geert van landeghem -...

75
A real-time architecture using Hadoop and Storm.

Upload: jaxlondonconference

Post on 26-Jan-2015

108 views

Category:

Technology


4 download

DESCRIPTION

Presented at JAX London 2013 With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Nathan Marz’s Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it.

TRANSCRIPT

Page 1: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop and Storm.

Page 2: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 2

Speaker

Nathan Bijnens@nathan_gs

Page 3: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 3

Our Vision

Big Data

test

Volume

Page 4: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 4

Big Data

test

Velocity

Page 5: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 5

Our Vision

Volume

test

Variety

Page 6: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 6

Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don’t Often

Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

Page 7: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 7

Credits

Nathan MarzEx-Backtype & TwitterStartup in StealthmodeStormCascalogElephantDB

manning.com/marz

Page 8: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 8

A Data System

Page 9: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 9

Not all information is equal.

Some information is derived from other pieces of information.

Data is more than Information

Page 10: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 10

Eventually you will reach the most ‘raw’ form of

information.This is the information you hold true, simple

because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Data is more than Information

Page 11: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 11

Events used to manipulate the master

data.

Events - Before

Page 12: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 12

Today, events are the master data.

Events - After

Page 13: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 13

Let’s store everything.

Data System

Page 14: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 14

Data is Immutable

Events

Page 15: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 15

Data is Time Based

Events

Page 16: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 16

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

Page 17: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 17

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Page 18: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 18

The data you query is often transformed, aggregated, ...

Rarely used in it’s original form.

Query

Page 19: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 19

Query

Query = function ( all data )

Page 20: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 20

Number of people living in each city.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

Page 21: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 22

Query

All Data Query

Page 22: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 23

Query: Precompute

All Data Query

Precomputed View

Page 23: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 24

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

Page 24: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 25

Layered Architecture

HadoopElephantDB

Query

Incoming Data

Cassandra

Page 25: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 26

Batch Layer

Page 26: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 27

Batch Layer

HadoopElephantDB

Incoming Data

Page 27: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 28

Unrestrained computation.

Batch Layer

Page 28: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 29

No need to De-Normalize.

Batch Layer

Page 29: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 30

Horizontal scalable.

Batch Layer

Page 30: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 31

High Latency.Let’s pretend temporarily that update

latency doesn’t matter.

Batch Layer

Page 31: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 32

Functional computation, based on immutable inputs, is idempotent.

Batch Layer

Page 32: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 33

Stores master copy of data set...

Batch Layer

append only.

Page 33: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 34

Batch Layer

Page 34: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 35

Batch: View generation

Master Dataset

View #1

View #3

View #2

MapRedu

ce

MapReduce

MapReduce

Page 35: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 36

1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

Output

MA

PR

ED

UC

E

MapReduce

DoWork() DoWork() DoWork()…

Page 36: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 37

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

Page 37: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 38

Serialization & Schema

CSV is actually a serialization language that is just poorly defined.

Page 38: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 39

Serialization & SchemaUse a format with a schema.- Thrift

- Avro

- Protobuffers

Added bonus: it’s faster & uses less space.

Page 39: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 40

Read only database.No random writes required.

Batch View Database

Page 40: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 41

Every iteration produces the Views from scratch.

Batch View Database

Page 41: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 42

Batch View DatabaseElephantDBSploutVoldemort…

Page 42: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 44

Batch Layer

Not yet absorbe

d.Data absorbed into Batch Views

Time No

w

We are not done yet…Just a few hours of data.

Page 43: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 45

Speed Layer

Page 44: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 46

Overview

HadoopElephantDB

Incoming Data

Cassandra

Page 45: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 47

Stream processing.

Speed Layer

Page 46: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 48

Continuous computation.

Speed Layer

Page 47: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 49

Transactional.

Speed Layer

Page 48: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 50

Storing a limited window of data.

Compensating for the last few hours of data.

Speed Layer

Page 49: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 51

All the complexity is isolated in the Speed layer.

If anything goes wrong, it’s auto-corrected.

Speed Layer

Page 50: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 52

CAPYou have a choice between:

Availability- Queries are eventual consistent.

Consistency- Queries are consistent.

Page 51: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 53

Some algorithms are hard to implement in real time. For those

cases we could estimate the results.

Eventual accuracy

Page 52: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 54

Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2

Page 53: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 55

StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.

Data in motion.

Page 54: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 56

Storm

Nimbus Zookeeper

Worker Node

SupervisorE

xecu

ter

Exe

cut

er

Exe

cut

er

Worker Node

SupervisorE

xecu

ter

Exe

cut

er

Exe

cut

er

Worker Node

SupervisorE

xecu

ter

Exe

cut

er

Exe

cut

er

Page 55: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 57

StormTuple

Stream

Page 56: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 58

StormSpout

Bolt

Page 57: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 59

StormGrouping

Page 58: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 60

Data IngestionKafkaFlumeScribe*MQKestrel

Page 59: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 61

Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra

- Hbase

- Redis

- MySQL

- ElasticSearch

- …

Much more complex than a read only view.

Page 60: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 62

Serving Layer

Page 61: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 63

Overview

HadoopElephantDB

Query

Incoming Data

Cassandra

Page 62: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 64

Random reads

Serving Layer

Page 63: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 65

This layer queries the Batch & Real Time views and merges it.

Serving Layer

Page 64: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 66

Serving Layer

Real Time Views

Merge

Batch Views

Page 65: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 67

How to query an Average?

Serving Layer

Page 66: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 68

Overview

Page 67: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 69

Overview

HadoopElephantDB

Query

Incoming Data

Cassandra

Page 68: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 70

Lambda Architecture

Page 69: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 71

Lambda Architecture

Can discard any view, batch and real time, and just recreate everything

from the master data.

Page 70: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 72

Lambda Architecture

Mistakes are corrected via recomputation.

Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.

Page 71: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 73

Lambda Architecture

Data storage is highly optimized.

Page 72: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 74

Lambda Architecture

Immutability changes everything.

Page 73: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 75

Questions?

Questions?@nathan_gs & #BigDataCon13

Page 74: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 76

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

Page 75: A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

A real-time architecture using Hadoop & Storm. #JaxLondon 77

Thank you

Thank you@nathan_gs