a real-time architecture using hadoop & storm - nathan bijnens & geert van landeghem -...
DESCRIPTION
Presented at JAX London 2013 With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Nathan Marz’s Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it.TRANSCRIPT
A real-time architecture using Hadoop and Storm.
A real-time architecture using Hadoop & Storm. #JaxLondon 2
Speaker
Nathan Bijnens@nathan_gs
A real-time architecture using Hadoop & Storm. #JaxLondon 3
Our Vision
Big Data
test
Volume
A real-time architecture using Hadoop & Storm. #JaxLondon 4
Big Data
test
Velocity
A real-time architecture using Hadoop & Storm. #JaxLondon 5
Our Vision
Volume
test
Variety
A real-time architecture using Hadoop & Storm. #JaxLondon 6
Computing Trends
Source: Immutability Changes Everything - Pat Helland, RICON2012
Computation (CPUs) Expensive
Disk Storage Expensive
Coordination Easy(Latches Don’t Often
Hit)
DRAM Expensive
Computation Cheap (Many Core Computers)
Disk Storage Cheap(Cheap Commodity Disks)
Coordination Hard(Latches Stall a Lot, etc)
DRAM / SSD Getting Cheap
Past Current
A real-time architecture using Hadoop & Storm. #JaxLondon 7
Credits
Nathan MarzEx-Backtype & TwitterStartup in StealthmodeStormCascalogElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. #JaxLondon 8
A Data System
A real-time architecture using Hadoop & Storm. #JaxLondon 9
Not all information is equal.
Some information is derived from other pieces of information.
Data is more than Information
A real-time architecture using Hadoop & Storm. #JaxLondon 10
Eventually you will reach the most ‘raw’ form of
information.This is the information you hold true, simple
because it exists.
Let’s call this ‘data’, very similar to ‘event’.
Data is more than Information
A real-time architecture using Hadoop & Storm. #JaxLondon 11
Events used to manipulate the master
data.
Events - Before
A real-time architecture using Hadoop & Storm. #JaxLondon 12
Today, events are the master data.
Events - After
A real-time architecture using Hadoop & Storm. #JaxLondon 13
Let’s store everything.
Data System
A real-time architecture using Hadoop & Storm. #JaxLondon 14
Data is Immutable
Events
A real-time architecture using Hadoop & Storm. #JaxLondon 15
Data is Time Based
Events
A real-time architecture using Hadoop & Storm. #JaxLondon 16
Capturing change traditionally
Person Location
Nathan Antwerp
Geert Dendermonde
John Ghent
Person Location
Nathan Ghent
Geert Dendermonde
John Ghent
A real-time architecture using Hadoop & Storm. #JaxLondon 17
Capturing change
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde
2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Person Location Timestamp
Nathan Antwerp 2005-01-01
Geert Dendermonde
2011-10-08
John Ghent 2010-05-02
A real-time architecture using Hadoop & Storm. #JaxLondon 18
The data you query is often transformed, aggregated, ...
Rarely used in it’s original form.
Query
A real-time architecture using Hadoop & Storm. #JaxLondon 19
Query
Query = function ( all data )
A real-time architecture using Hadoop & Storm. #JaxLondon 20
Number of people living in each city.
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde
2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Location Count
Ghent 2
Dendermonde 1
A real-time architecture using Hadoop & Storm. #JaxLondon 22
Query
All Data Query
A real-time architecture using Hadoop & Storm. #JaxLondon 23
Query: Precompute
All Data Query
Precomputed View
A real-time architecture using Hadoop & Storm. #JaxLondon 24
Layered Architecture
Speed Layer
Batch Layer
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 25
Layered Architecture
HadoopElephantDB
Query
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 26
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 27
Batch Layer
HadoopElephantDB
Incoming Data
A real-time architecture using Hadoop & Storm. #JaxLondon 28
Unrestrained computation.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 29
No need to De-Normalize.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 30
Horizontal scalable.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 31
High Latency.Let’s pretend temporarily that update
latency doesn’t matter.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 32
Functional computation, based on immutable inputs, is idempotent.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 33
Stores master copy of data set...
Batch Layer
append only.
A real-time architecture using Hadoop & Storm. #JaxLondon 34
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 35
Batch: View generation
Master Dataset
View #1
View #3
View #2
MapRedu
ce
MapReduce
MapReduce
A real-time architecture using Hadoop & Storm. #JaxLondon 36
1. Take a large data set and divide it into subsets
2. Perform the same function on all subsets
3. Combine the output from all subsets
…
…
Output
MA
PR
ED
UC
E
MapReduce
DoWork() DoWork() DoWork()…
A real-time architecture using Hadoop & Storm. #JaxLondon 37
Serialization & Schema
Catch errors as quickly as they happen. Validation on write vs on read.
A real-time architecture using Hadoop & Storm. #JaxLondon 38
Serialization & Schema
CSV is actually a serialization language that is just poorly defined.
A real-time architecture using Hadoop & Storm. #JaxLondon 39
Serialization & SchemaUse a format with a schema.- Thrift
- Avro
- Protobuffers
Added bonus: it’s faster & uses less space.
A real-time architecture using Hadoop & Storm. #JaxLondon 40
Read only database.No random writes required.
Batch View Database
A real-time architecture using Hadoop & Storm. #JaxLondon 41
Every iteration produces the Views from scratch.
Batch View Database
A real-time architecture using Hadoop & Storm. #JaxLondon 42
Batch View DatabaseElephantDBSploutVoldemort…
A real-time architecture using Hadoop & Storm. #JaxLondon 44
Batch Layer
Not yet absorbe
d.Data absorbed into Batch Views
Time No
w
We are not done yet…Just a few hours of data.
A real-time architecture using Hadoop & Storm. #JaxLondon 45
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 46
Overview
HadoopElephantDB
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 47
Stream processing.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 48
Continuous computation.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 49
Transactional.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 50
Storing a limited window of data.
Compensating for the last few hours of data.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 51
All the complexity is isolated in the Speed layer.
If anything goes wrong, it’s auto-corrected.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 52
CAPYou have a choice between:
Availability- Queries are eventual consistent.
Consistency- Queries are consistent.
A real-time architecture using Hadoop & Storm. #JaxLondon 53
Some algorithms are hard to implement in real time. For those
cases we could estimate the results.
Eventual accuracy
A real-time architecture using Hadoop & Storm. #JaxLondon 54
Speed Layer
Incoming Data
Real Time
View 1
Real Time
View 2
A real-time architecture using Hadoop & Storm. #JaxLondon 55
StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.
Data in motion.
A real-time architecture using Hadoop & Storm. #JaxLondon 56
Storm
Nimbus Zookeeper
Worker Node
SupervisorE
xecu
ter
Exe
cut
er
Exe
cut
er
Worker Node
SupervisorE
xecu
ter
Exe
cut
er
Exe
cut
er
Worker Node
SupervisorE
xecu
ter
Exe
cut
er
Exe
cut
er
A real-time architecture using Hadoop & Storm. #JaxLondon 57
StormTuple
Stream
A real-time architecture using Hadoop & Storm. #JaxLondon 58
StormSpout
Bolt
A real-time architecture using Hadoop & Storm. #JaxLondon 59
StormGrouping
A real-time architecture using Hadoop & Storm. #JaxLondon 60
Data IngestionKafkaFlumeScribe*MQKestrel
A real-time architecture using Hadoop & Storm. #JaxLondon 61
Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra
- Hbase
- Redis
- MySQL
- ElasticSearch
- …
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. #JaxLondon 62
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 63
Overview
HadoopElephantDB
Query
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 64
Random reads
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 65
This layer queries the Batch & Real Time views and merges it.
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 66
Serving Layer
Real Time Views
Merge
Batch Views
A real-time architecture using Hadoop & Storm. #JaxLondon 67
How to query an Average?
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 68
Overview
A real-time architecture using Hadoop & Storm. #JaxLondon 69
Overview
HadoopElephantDB
Query
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 70
Lambda Architecture
A real-time architecture using Hadoop & Storm. #JaxLondon 71
Lambda Architecture
Can discard any view, batch and real time, and just recreate everything
from the master data.
A real-time architecture using Hadoop & Storm. #JaxLondon 72
Lambda Architecture
Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
A real-time architecture using Hadoop & Storm. #JaxLondon 73
Lambda Architecture
Data storage is highly optimized.
A real-time architecture using Hadoop & Storm. #JaxLondon 74
Lambda Architecture
Immutability changes everything.
A real-time architecture using Hadoop & Storm. #JaxLondon 75
Questions?
Questions?@nathan_gs & #BigDataCon13
A real-time architecture using Hadoop & Storm. #JaxLondon 76
DataCrunchers
We enable companies in envisioning, defining and implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
A real-time architecture using Hadoop & Storm. #JaxLondon 77
Thank you
Thank you@nathan_gs