from source to solution - building a system for event-oriented data

26
building a system for event-oriented data [email protected] / @esammer cto @ scalingdata

Upload: eric-sammer

Post on 14-Jul-2015

1.589 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: from source to solution - building a system for event-oriented data

building a system for

event-oriented [email protected] / @esammer

cto @ scalingdata

Page 2: from source to solution - building a system for event-oriented data

“…try to fill the spaces

in between.”[1]

[1] obligatory htda reference

Page 3: from source to solution - building a system for event-oriented data

what we do

we build a system for the operation of modern data

centers

triage and diagnostics, exploration, trends,

advanced analytics of complex systems

our data: logs, metrics, human activity, anything

that occurs in the data center

“enterprise software” (i.e. we build for others.)

Page 4: from source to solution - building a system for event-oriented data

our typical customer

>1TB/hr (100s of millions of events and up), sub-second end to

end latency, full fidelity retention, critical use cases

quality of service - “are credit card transactions happening fast

enough?”

fraud detection - “detect, investigate, prosecute, and learn from

fraud.”

forensic diagnostics - “what really caused the outage last friday?”

security - “who’s doing what, where, when, why, and how, and is

that ok?”

Page 5: from source to solution - building a system for event-oriented data

overview

Page 6: from source to solution - building a system for event-oriented data

guarantees

no single point of failure exists

all components scale horizontally[1]

data retention and latency is a function of cost, not tech[1]

every event is delivered if r - 1 kafka brokers are, or can

be made, available (unless you do silly stuff)

all operations, including upgrade, are online[2]

every event is delivered exactly once[2]

[1] we’re positive there’s a limit, but thus far it has been cost.[2] from the user’s perspective, at a system level.

Page 7: from source to solution - building a system for event-oriented data

modeling our world

everything is an event

each event contains a timestamp, type, location,

host, service, body, and type-specific attributes

(k/v pairs)

build aggregates as necessary - just optimized

views of the data

Page 8: from source to solution - building a system for event-oriented data

event schema

{

ts: long,

event_type_id: int,

location: string,

host: string,

service: string,

body: [ null, bytes ],

attributes: map<string>

}

Page 9: from source to solution - building a system for event-oriented data

event types

some event types are standard

syslog, http, log4j, generic text record, …

users define custom event types

producer populates event type

event type metadata tells downstream systems how

to interpret body and attributes

Page 10: from source to solution - building a system for event-oriented data

ex: generic syslog event

event_type_id: 100, // rfc3164, rfc5424 (syslog)

body: … // raw syslog message bytes

attributes: {

syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”,

syslog_severity: “6”, // info severity

syslog_facility: “3”, // daemon facility

syslog_process: “dhclient”,

syslog_pid: “668”,

}

Generic fields omitted

Page 11: from source to solution - building a system for event-oriented data

ex: generic http event

event_type_id: 102, // generic http event

body: … // raw http log message bytes

attributes: {

http_req_method: “GET”,

http_req_vhost: “w2a-demo-02”,

http_req_url: “/api/v1/search?q=service%3Asshd&p=1&s=200”,

http_req_query: “q=service%3Asshd&p=1&s=200”,

http_resp_code: “200”,

}

Generic fields omitted

Page 12: from source to solution - building a system for event-oriented data

consumers

…do most of the work

parallelism

kafka offset management

message de-duplication

transformation (embedded library)

downstream system knowledge

Page 13: from source to solution - building a system for event-oriented data

consumers

…do most of the work

parallelism

kafka offset management

message de-duplication

transformation (embedded library)

downstream system knowledge

Page 14: from source to solution - building a system for event-oriented data

inside a consumer

Page 15: from source to solution - building a system for event-oriented data

aggregation

mostly for metrics, just a specialized consumer

reduce(a: A, b: A): B

all functions are commutative and associative

spark streaming :(

(dimensions) => (aggregates)

Page 16: from source to solution - building a system for event-oriented data

ex: service volume by min

dimensions: yyyy, mm, dd, hh, mi, host, service

aggregates: count(1)

2014, 01, 01, 00, 00, w2a-demo-1, sshd => 17

SELECT year, month, day, hour, minute,

host, service, count(1) FROM events

GROUP BY year, month, day, hour,

minute, host, service

Page 17: from source to solution - building a system for event-oriented data

fancy math stuff

anomalous event detection

event clustering

building the models: micro-batches in spark

classification/scoring: streaming process

Page 18: from source to solution - building a system for event-oriented data

application

picks the query engine based on operation

graphs and metrics = sql (impala)

event search = solr

soon: fancier realtime stuff (web sockets -> kafka)

Page 19: from source to solution - building a system for event-oriented data

reprocessing data

sometimes you get it wrong

include “kafka coordinates” in data

topic, partition, offset

to rebuild data: 1. pick a point in time, 2. find

coordinates at t - 1, 3. delete and replay from there.

Page 20: from source to solution - building a system for event-oriented data

reprocessing data

“you can’t (always) go home” - kafka retention is

limited. not a persistent store.

solution: retain a raw+ dataset from which you can

rebuild

raw+: raw event data. enrichments are allowed, but

no destructive operations (masking, filtering, etc.)

that would preclude reprocessing.

Page 21: from source to solution - building a system for event-oriented data

oh the inefficiency!

kafka coordinates in data

raw+ dataset

materialized aggregates

searchable version in solr

Page 22: from source to solution - building a system for event-oriented data

oh the… reality?

solr =~ multiple rdbms indexes

aggregates =~ materialized views

kafka coordinate storage is largely erased by smart

hdfs storage formats (i.e. parquet)

pax structured, dictionary encoding, RLE, bit

packing, generic compression, inline metadata

Page 23: from source to solution - building a system for event-oriented data

extension

custom producers

custom consumers

event types

parsers / transformations

cube definitions and aggregate functions

custom processing jobs on landed data

Page 24: from source to solution - building a system for event-oriented data

pain

lots of tradeoffs when picking a stream processing solution

samza: right features, but low level programming model, not

supported by vendors. missing security features.

storm: too rigid, too slow. not supported by all vendors.

spark streaming: where do i even start? tons of issues, but has

all the community energy. our current bet, for better or worse.

stack complexity

relative (im)maturity

Page 25: from source to solution - building a system for event-oriented data

if you’re going to do this

read all the literature on stream processing[1]

understand, make, and provide guarantees

find the right abstractions

never trust the hand waving or “hello worlds”

fully evaluate the projects in this space

[1] wait, like all of it? yea, like all of it.

Page 26: from source to solution - building a system for event-oriented data

thanks <3

questions?

[email protected] / @esammer

http://www.scalingdata.com

(hiring)