from source to solution - building a system for event-oriented data

building a system for

event-oriented [email protected] / @esammer

cto @ scalingdata

“…try to fill the spaces

in between.”[1]

[1] obligatory htda reference

what we do

we build a system for the operation of modern data

centers

triage and diagnostics, exploration, trends,

advanced analytics of complex systems

our data: logs, metrics, human activity, anything

that occurs in the data center

“enterprise software” (i.e. we build for others.)

our typical customer

>1TB/hr (100s of millions of events and up), sub-second end to

end latency, full fidelity retention, critical use cases

quality of service - “are credit card transactions happening fast

enough?”

fraud detection - “detect, investigate, prosecute, and learn from

fraud.”

forensic diagnostics - “what really caused the outage last friday?”

security - “who’s doing what, where, when, why, and how, and is

that ok?”

overview

guarantees

no single point of failure exists

all components scale horizontally[1]

data retention and latency is a function of cost, not tech[1]

every event is delivered if r - 1 kafka brokers are, or can

be made, available (unless you do silly stuff)

all operations, including upgrade, are online[2]

every event is delivered exactly once[2]

[1] we’re positive there’s a limit, but thus far it has been cost.[2] from the user’s perspective, at a system level.

modeling our world

everything is an event

each event contains a timestamp, type, location,

host, service, body, and type-specific attributes

(k/v pairs)

build aggregates as necessary - just optimized

views of the data

event schema

{

ts: long,

event_type_id: int,

location: string,

host: string,

service: string,

body: [ null, bytes ],

attributes: map<string>

}

event types

some event types are standard

syslog, http, log4j, generic text record, …

users define custom event types

producer populates event type

event type metadata tells downstream systems how

to interpret body and attributes

ex: generic syslog event

event_type_id: 100, // rfc3164, rfc5424 (syslog)

body: … // raw syslog message bytes

attributes: {

syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”,

syslog_severity: “6”, // info severity

syslog_facility: “3”, // daemon facility

syslog_process: “dhclient”,

syslog_pid: “668”,

…

}

Generic fields omitted

ex: generic http event

event_type_id: 102, // generic http event

body: … // raw http log message bytes

attributes: {

http_req_method: “GET”,

http_req_vhost: “w2a-demo-02”,

http_req_url: “/api/v1/search?q=service%3Asshd&p=1&s=200”,

http_req_query: “q=service%3Asshd&p=1&s=200”,

http_resp_code: “200”,

…

}

Generic fields omitted

consumers

…do most of the work

parallelism

kafka offset management

message de-duplication

transformation (embedded library)

downstream system knowledge

inside a consumer

aggregation

mostly for metrics, just a specialized consumer

reduce(a: A, b: A): B

all functions are commutative and associative

spark streaming :(

(dimensions) => (aggregates)

ex: service volume by min

dimensions: yyyy, mm, dd, hh, mi, host, service

aggregates: count(1)

2014, 01, 01, 00, 00, w2a-demo-1, sshd => 17

SELECT year, month, day, hour, minute,

host, service, count(1) FROM events

GROUP BY year, month, day, hour,

minute, host, service

fancy math stuff

anomalous event detection

event clustering

building the models: micro-batches in spark

classification/scoring: streaming process

application

picks the query engine based on operation

graphs and metrics = sql (impala)

event search = solr

soon: fancier realtime stuff (web sockets -> kafka)

reprocessing data

sometimes you get it wrong

include “kafka coordinates” in data

topic, partition, offset

to rebuild data: 1. pick a point in time, 2. find

coordinates at t - 1, 3. delete and replay from there.

reprocessing data

“you can’t (always) go home” - kafka retention is

limited. not a persistent store.

solution: retain a raw+ dataset from which you can

rebuild

raw+: raw event data. enrichments are allowed, but

no destructive operations (masking, filtering, etc.)

that would preclude reprocessing.

oh the inefficiency!

kafka coordinates in data

raw+ dataset

materialized aggregates

searchable version in solr

oh the… reality?

solr =~ multiple rdbms indexes

aggregates =~ materialized views

kafka coordinate storage is largely erased by smart

hdfs storage formats (i.e. parquet)

pax structured, dictionary encoding, RLE, bit

packing, generic compression, inline metadata

extension

custom producers

custom consumers

event types

parsers / transformations

cube definitions and aggregate functions

custom processing jobs on landed data

pain

lots of tradeoffs when picking a stream processing solution

samza: right features, but low level programming model, not

supported by vendors. missing security features.

storm: too rigid, too slow. not supported by all vendors.

spark streaming: where do i even start? tons of issues, but has

all the community energy. our current bet, for better or worse.

stack complexity

relative (im)maturity

if you’re going to do this

read all the literature on stream processing[1]

understand, make, and provide guarantees

find the right abstractions

never trust the hand waving or “hello worlds”

fully evaluate the projects in this space

[1] wait, like all of it? yea, like all of it.

thanks <3

questions?

[email protected] / @esammer

http://www.scalingdata.com

(hiring)

from source to solution - building a system for event-oriented data

Data & Analytics

generic http eventevent

event typessome event

online2every event

tech1every event

generic http eventbody

generic syslog eventevent

event typeevent type

custom event typesproducer