using riak for events storage and analysis at...

Using Riak for Events storage and analysis at Booking.com

Damien Krotkine

Damien Krotkine

• Software Engineer at Booking.com

• github.com/dams

• @damsieboy

• dkrotkine

KEY FIGURES

• 600,000 hotels

• 212 countries

• 800,000 room nights every 24 hours

• guest reviews 43 million+

• offices worldwide 155+

• 8,600 people

• not a small website…

INTRODUCTION

www APImobi

www APImobi

front

end

www APIfro

nten

dba

cken

dmobi

events storage

events: info about subsystems status

back

end

web mobi api

databases

caches

load balancersavailability

cluster

email

etc…

WHAT IS AN EVENT ?

EVENT STRUCTURE

• Provides info about subsystems

• Data

• Deep HashMap

• Timestamp

• Type + Subtype

• The rest: specific data

• Schema-less

{ timestamp => 12345, type => 'WEB', subtype => 'app', dc => 1, action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }

{ type => 'FAV', subtype => 'fav', timestamp => 1401262979, dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }

EVENTS FLOW PROPERTIES

• Read-only

• Schema-less

• Continuous, sequential, timed

• 15 K events per sec

• 1.25 Billion events per day

• peak at 70 MB/s, min 25MB/s

• 100 GB per hour

ASSESS THE NEEDS

• Before thinking about storage

• Think about the usage

USAGE

1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING

GRAPHS

• Graph in real-time ( few seconds lag )

• Graph as many systems as possible

• General platform health check

GRAPHS

DASHBOARDS

META GRAPHS

USAGE


DECISION MAKING

• Strategic decision ( use facts )

• Long term or short term

• Technical / Non technical Reporting

USAGE


SHORT TERM ANALYSIS

• From 10 sec ago -> 8 days ago

• Code deployment checks and rollback

• Anomaly Detector

USAGE


A/B TESTING

• Our core philosophy: use facts

• It means: do A/B testing

• Concept of Experiments

• Events provide data to compare

EVENT AGGREGATION

EVENT AGGREGATION

• Group events

• Granularity we need: second

SERIALIZATION

• JSON didn’t work for us (slow, big, lack features)

• Created Sereal in 2012

• « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization »

• https://github.com/Sereal/Sereal

https://github.com/Sereal/Sereal

eventeventevents storage

eventevent

even

t

eventevent

eventevent

event

even

t

event event

e ee e

ee

e e

e ee

e eee

ee

e

LOGGER

e

e

web apieeeeee

eee

e ee e

ee

e e

e ee

e eee

ee

ee

e

web api dbseeeeeeee

eeeeeeeee

eeeeeee

e ee e

ee

e e

e ee

e eee

ee

ee

e

web api dbseeeeeeee

eeeeeeeee

eeeeeee

e ee e

ee

e e

e ee

e eee

ee

e

1 sec

e

e

web api dbseeeee

eeeee

eeeee

1 sec

events storage

web api dbs

ee

eeeee

eeeee

1 sec

events storage

ee e reserialize + compress

events storage

LOGGER …LOGGER LOGGER

STORAGE

WHAT WE WANT

• Storage security

• Mass write performance

• Mass read performance

• Easy administration

• Very scalable

WE CHOSE RIAK

• Security: cluster, distributed, very robust

• Good and predictable read / write performance

• The easiest to setup and administrate

• Advanced features (MapReduce, triggers, 2i, CRDTs …)

• Riak Search

• Multi Datacenter Replication

CLUSTER

• Commodity hardware • All nodes serve data • Data replication

• Gossip between nodes • No master • distributed system

Ring of servers

hash(key)

KEY VALUE STORE

• Namespaces: bucket

• Values: opaque or CRDTs

RIAK: ADVANCED FEATURES

• MapReduce

• Secondary indexes (2i)

• Riak Search

• Multi DataCenter Replication

MULTI-BACKEND FOR STORAGE

• Bitcask

• Eleveldb

• Memory

BACKEND: BITCASK

• Log-based storage backend

• Append-only files (AOF files)

• Advanced expiration

• Predictable performance (1 disk-seek max)

• Perfect for sequential data

CLUSTER CONFIGURATION

DISK SPACE NEEDED

• 8 days

• 100 GB per hour

• Replication 3

• 100 * 24 * 8 * 3

• Need 60 T

HARDWARE

• 12 then 16 nodes (soon 24)

• 12 CPU cores ( Xeon 2.5Ghz)

• 192 GB RAM

• network 1 Gbit/s

• 8 TB (raid 6)

• Cluster total space: 128 TB

DATA DESIGN

web api dbs

ee

eeeee

eeeee

1 sec

events storage

1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks

DATA

• Bucket name: “data“

• Key: “12345:1:cell0:WEB:app:chunk0“

• Value: List of events (Hashmaps), serialized & compressed

• 200 keys per seconds

METADATA

• Bucket name: “metadata“

• Key: <epoch>-<dc> “1428415043-2“

• Value: list of data keys:[ “1428415043:1:cell0:WEB:app:chunk0“, “1428415043:1:cell0:WEB:app:chunk1“ … “1428415043:4:cell0:EMK::chunk3“ ]

• As pipe separated value (PSV)

WRITE DATA

PUSH DATA IN

• In each DC, in each cell, Loggers push to Riak

• Use ProtoBuf

• Every seconds:

• Push data values to Riak, async

• Wait for success

• Push metadata

JAVA

Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2", Data3).execute();

Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();

Perl

my $client = Riak::Client->new(…);

$client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3);

$client->put(metadata => '12345-1', $metadata, 'text/plain' );

READ DATA

READ ONE SECOND

• For one second (a given epoch)

• Request metadata for <epoch>-DC

• Parse value

• Filter out unwanted types / subtypes

• Fetch the keys from the “data” bucket

Perl

my $client = Riak::Client->new(…); my @array = split '\|', $client->get(metadata => '1428415043-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @filtered_array;

READ ONE SECOND

• For an interval epoch1 -> epoch2

• Generate the list of epochs

• Fetch in parallel

RIAK CLUSTER

DISK IOPS

DISK IO %

DISK SPACE RECLAIMED

one day

REAL TIME PROCESSING OUTSIDE OF RIAK

STREAMING

• Fetch 1 second every second

• Or a range ( ex: last 10 min )

• Fetch all epochs from Riak

• Use data on the client side

EXAMPLES

• Streaming => Graphite ( every sec )

• Streaming => Anomaly Detector ( last 2 min )

• Streaming => Experiment analysis ( last day )

• Every minute => Hadoop

• Manual request => test, debug, investigate

• Batch fetch => ad hoc analysis

• => Huge numbers of read requests

events storage

graphite cluster

Anomaly detector

experimentcluster

hadoop cluster

mysql analysis

manual requests

50 MB/s

50 MB/s

50 M

B/s 50 MB/s

50 MB/s50 MB/s

THIS IS REALTIME

• 1 second of data

• Stored in < 1 sec

• Available after < 1 sec

• Issue : network saturation

REAL TIME PROCESSING INSIDE RIAK

THE IDEA

• Instead of

• Fetch data,

• Crunch data (ex: average),

• Produce a small result

• Do

• Bring code to data

• Crunch data on Riak

• Fetch the result

WHAT TAKES TIME

• Takes a lot of time

• Fetching data out: network issue

• Decompressing: CPU time issue

• Takes almost no time

• Crunching data

MAPREDUCE

• Input: epoch-dc

• Map1: metadata keys => data keys

• Map2: data crunching

• Reduce: aggregate

• Realtime: OK

• network usage: OK

• CPU time: NOT OK

HOOKS

• Every time metadata is written

• Post-Commit hook triggered

• Crunch data on the nodes

Riak post-commit hook

REST serviceRIAK service

key keysocket

result sent for storage

decompressprocess all tasks

NODE HOST

HOOK CODE

metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject), [ Epoch, DC ] = binary:split(Key, <<"-">>), MetaData = riak_object:get_value(RiakObject), DataKeys = binary:split(MetaData, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.

send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://" ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.

REST SERVICE

• In Perl, using PSGI (WSGI-like), Starman, preforks

• Allow to write data cruncher in Perl

• Also supports loading code on demand

ADVANTAGES

• CPU usage and execution time can be capped

• Data is local to processing

• Two systems are decoupled

• REST service written in any language

• Data processing done all at once

• Data is decompressed only once

DISADVANTAGES

• Only for incoming data (streaming), not old data

• Can’t easily use cross-second data

• What if the companion service goes down ?

FUTURE

• Use this companion to generate optional small values

• Use Riak Search to index and search those

THE BANDWIDTH PROBLEM

• PUT - bad case

• n_val = 3

• inside usage = 3 x outside usage

• PUT - good case

• n_val = 3


• GET - bad case


• GET - good case


• network usage ( PUT and GET ): • 3 x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network

• Usually it’s not a problem • But in our case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s

• => network bandwidth issue

THE BANDWIDTH SOLUTIONS


1. Optimize GET for network usage, not speed 2. Don’t choose a node at random

• GET - bad case

• n_val = 1

• inside usage = 1 x outside

• GET - good case

• n_val = 1

• inside usage = 0 x outside

WARNING

• Possible only because data is read-only

• Data has internal checksum

• No conflict possible

• Corruption detected

RESULT

• practical network usage reduced by 2 !


1. Optimize GET for network usage, not speed 2. Don’t choose a node at random

• bucket = “metadata” • key = “12345”

• bucket = “metadata”

• key = “12345”

Hash = hashFunction(bucket + key)

RingStatus = getRingStatus

PrimaryNodes = Fun(Hash, RingStatus)

hashFunction()

getRingStatus()

THE IDEA

• Do the hashing on the client • By default:

• “chash_keyfun”:{"mod":"riak_core_util", "fun":"chash_std_keyfun"},

• look at the source on github • riak_core/src/chash.erl

THE HASH FUNCTION

• Easy to re-implement client-side

• sha1 as a BigInt

• example: sha1(bucket + key) = 2450

RING STATUS

• If hash = 2450 => nodes 16, 17, 18

$ curl -s -XGET http://$host:8098/riak_preflists/myring{ “0" :"[email protected]" “5708990770823839524233143877797980545530986496” :"[email protected]" “11417981541647679048466287755595961091061972992”:"[email protected]" “17126972312471518572699431633393941636592959488":“[email protected]” …etc…}

WARNING

• Possible only if • Nodes list is monitored • In case of failed node, default to random • Data is requested in an uniform way

RESULT

• Network usage even more reduced ! • Especially for GETs

CONCLUSION

CONCLUSION

• We used only Riak Open Source

• No training, self-taught, small team

• Riak is a great solution

• Robust, fast, scalable, easy

• Very flexible and hackable

• Helps us continue scaling

Q&A@damsieboy

using riak for events storage and analysis at...

Documents