from source to solution - building a system for event-oriented data
TRANSCRIPT
“…try to fill the spaces
in between.”[1]
[1] obligatory htda reference
what we do
we build a system for the operation of modern data
centers
triage and diagnostics, exploration, trends,
advanced analytics of complex systems
our data: logs, metrics, human activity, anything
that occurs in the data center
“enterprise software” (i.e. we build for others.)
our typical customer
>1TB/hr (100s of millions of events and up), sub-second end to
end latency, full fidelity retention, critical use cases
quality of service - “are credit card transactions happening fast
enough?”
fraud detection - “detect, investigate, prosecute, and learn from
fraud.”
forensic diagnostics - “what really caused the outage last friday?”
security - “who’s doing what, where, when, why, and how, and is
that ok?”
overview
guarantees
no single point of failure exists
all components scale horizontally[1]
data retention and latency is a function of cost, not tech[1]
every event is delivered if r - 1 kafka brokers are, or can
be made, available (unless you do silly stuff)
all operations, including upgrade, are online[2]
every event is delivered exactly once[2]
[1] we’re positive there’s a limit, but thus far it has been cost.[2] from the user’s perspective, at a system level.
modeling our world
everything is an event
each event contains a timestamp, type, location,
host, service, body, and type-specific attributes
(k/v pairs)
build aggregates as necessary - just optimized
views of the data
event schema
{
ts: long,
event_type_id: int,
location: string,
host: string,
service: string,
body: [ null, bytes ],
attributes: map<string>
}
event types
some event types are standard
syslog, http, log4j, generic text record, …
users define custom event types
producer populates event type
event type metadata tells downstream systems how
to interpret body and attributes
ex: generic syslog event
event_type_id: 100, // rfc3164, rfc5424 (syslog)
body: … // raw syslog message bytes
attributes: {
syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”,
syslog_severity: “6”, // info severity
syslog_facility: “3”, // daemon facility
syslog_process: “dhclient”,
syslog_pid: “668”,
…
}
Generic fields omitted
ex: generic http event
event_type_id: 102, // generic http event
body: … // raw http log message bytes
attributes: {
http_req_method: “GET”,
http_req_vhost: “w2a-demo-02”,
http_req_url: “/api/v1/search?q=service%3Asshd&p=1&s=200”,
http_req_query: “q=service%3Asshd&p=1&s=200”,
http_resp_code: “200”,
…
}
Generic fields omitted
consumers
…do most of the work
parallelism
kafka offset management
message de-duplication
transformation (embedded library)
downstream system knowledge
consumers
…do most of the work
parallelism
kafka offset management
message de-duplication
transformation (embedded library)
downstream system knowledge
inside a consumer
aggregation
mostly for metrics, just a specialized consumer
reduce(a: A, b: A): B
all functions are commutative and associative
spark streaming :(
(dimensions) => (aggregates)
ex: service volume by min
dimensions: yyyy, mm, dd, hh, mi, host, service
aggregates: count(1)
2014, 01, 01, 00, 00, w2a-demo-1, sshd => 17
SELECT year, month, day, hour, minute,
host, service, count(1) FROM events
GROUP BY year, month, day, hour,
minute, host, service
fancy math stuff
anomalous event detection
event clustering
building the models: micro-batches in spark
classification/scoring: streaming process
application
picks the query engine based on operation
graphs and metrics = sql (impala)
event search = solr
soon: fancier realtime stuff (web sockets -> kafka)
reprocessing data
sometimes you get it wrong
include “kafka coordinates” in data
topic, partition, offset
to rebuild data: 1. pick a point in time, 2. find
coordinates at t - 1, 3. delete and replay from there.
reprocessing data
“you can’t (always) go home” - kafka retention is
limited. not a persistent store.
solution: retain a raw+ dataset from which you can
rebuild
raw+: raw event data. enrichments are allowed, but
no destructive operations (masking, filtering, etc.)
that would preclude reprocessing.
oh the inefficiency!
kafka coordinates in data
raw+ dataset
materialized aggregates
searchable version in solr
oh the… reality?
solr =~ multiple rdbms indexes
aggregates =~ materialized views
kafka coordinate storage is largely erased by smart
hdfs storage formats (i.e. parquet)
pax structured, dictionary encoding, RLE, bit
packing, generic compression, inline metadata
extension
custom producers
custom consumers
event types
parsers / transformations
cube definitions and aggregate functions
custom processing jobs on landed data
pain
lots of tradeoffs when picking a stream processing solution
samza: right features, but low level programming model, not
supported by vendors. missing security features.
storm: too rigid, too slow. not supported by all vendors.
spark streaming: where do i even start? tons of issues, but has
all the community energy. our current bet, for better or worse.
stack complexity
relative (im)maturity
if you’re going to do this
read all the literature on stream processing[1]
understand, make, and provide guarantees
find the right abstractions
never trust the hand waving or “hello worlds”
fully evaluate the projects in this space
[1] wait, like all of it? yea, like all of it.