sampling is hard

SAMPLING A STREAM OF EVENTS WITH A SKETCH

PREETAM JINKA • BARON SCHWARTZ • VIVIDCORTEX

MONITORAMA • JUNE 2015

INTRODUCTIONS

VividCortex is the best way to see what your databases are doing in production

Preetam Jinka, Software Engineer

@PreetamJinka

[email protected]

Baron Schwartz, CEO/Founder

@xaprb

[email protected]

mailto:[email protected]


A STREAM OF EVENTS IN TIME

Time

COMPUTE METRICS ABOUT THE EVENTS

METRICS ALONE ARE NOT ENOUGH

WE WANT SAMPLES OF THESE EVENTS

REPRESENTATIVE SAMPLING IS HARDIt’s Hard To Pick Individual Samples

REPRESENTATIVE SAMPLING MATTERS

EVENTS ARE DIVERSE AND COMPLEX

GOALSSample enough events but not too many

Select representative events

Bias “important” events

Avoid “private” events

Balance sampling between rare and frequent events

Achieve desired overall sampling rate

CONFLICTING GOALSBias towards “important” events, versus rate limitingRare versus frequent versus overall sampling rateCorrectness versus efficiency

POSSIBLE APPROACHESSelect every Nth eventSelect worst event per time periodSelect random event per time period

STATISTICS TO THE RESCUE?If events are generated by a Poisson process, then:

Constant average rate, exponential inter-arrival times

If we choose samples using an exponential probability, then samples would be Poisson too

Time

Probability

(approaches 1.0)

Probability of selecting an event, given the time since last event is 1 - e - λt

USING EXPONENTIAL PROBABILITIES

Time

Probability

Events

WE CHOSE A SIMPLER APPROACH

Time

Probability

Events

We use a linearly increasing probability that will produce a uniform distribution of samples

ANALOGY: WAITING FOR A BUS

Buses arrive at a stop every 5 minutes on averageYou arrive 2 minutes after the last bus leftHow long should you expect to wait?

WHY DO WE DO THAT?Easy to understandLess computationally expensiveReliable in low-frequency scenarios

BEWARE: PROBABILITY GOTCHA

Time

Probability

Events

Should we use time since last event, or time since last sample, to compute the probability of selecting an event?

We Sampled This Event

Which strategy is correct?

USE THE TIME SINCE THE LAST EVENT

Time

Probability

Events

Probabilities are additive, and probability since the last sample grows a lot faster than probability since last event

EFFICIENCY CHALLENGES“Remembering” millions of categories creates memory and CPU load

We use an LRU to “forget” stale categories for efficiency

Lots of edge cases can result (oversampling, undersampling)

EFFICIENCY SOLUTION?We’d like a cheap way to “remember” the last time we’ve seen a category of query, even if it’s approximate

A SKETCH TO THE RESCUE!A sketch is a compact, probabilisitic data structure

Trades off accuracy for resources (CPU, memory)

Similar in nature to a bloom filter

WE “INVENTED” A SKETCH

We were inspired by the Count-Min Sketch

Instead of frequency, we needed last-seen timestamp

We call it the “Last-Seen Sketch”

It is compact and efficient (memory + CPU)

It errs on the side of undersampling

THE LAST-SEEN SKETCH

The sketch is several arrays of timestamps

Categories of events map to cells by hash andmodulus.

Each event will hash & modulus to one cell in each array.

With 4 arrays, it’s stored 4 places.

TS 0 TS 1 TS 2 TS 3 TS 4 TS 5

ARRAY 0 1 4 7 9 8 3

ARRAY 1 7 3 2 9 1

ARRAY 2 1 3 9 7

ARRAY 3 3 9 7

STORING AN EVENT’S TIMESTAMP

Store ts=8 for event that hashes to 20. Where are its values stored?

20 % 6 => index 2

20 % 5 => 0

20 % 4 => 0

20 % 3 => 2


ARRAY 0 1 4 8 9 8 3

ARRAY 1 8 3 2 9 1

ARRAY 2 8 3 9 7

ARRAY 3 3 9 8

LOOKING UP A VALUE

Example: find stored timestamp for event that hashes to 13.

Indices are 1, 3, 1, 1.

Choose the lowest value.

Result: value is 3.


ARRAY 0 1 4 8 9 8 3

ARRAY 1 8 3 2 9 1

ARRAY 2 8 3 9 7

ARRAY 3 3 9 8

PUTTING IT ALL TOGETHER

Events are categorized and flagged in various ways

Important events: long-running, has an error, etcIneligible: has blacklisted text that’s sensitive/private, etc

Events are then eligible for selecting as a sample

PUTTING IT ALL TOGETHER

Probability of selecting the event is determined with the Last-Seen SketchOn collision, we err on the side of undersampling (very small prob)Events are selected and transmitted to our APIs

NOW WE HAVE METRICS + SAMPLES

RATE LIMITSImportant to prevent DOS’ing ourselvesCurrent: implemented with a global sample quota per interval of timeFuture: likely will use a EWMA to influence overall sampling probability

IMPORTANT EVENTSNot all events are created equalBias sampling towards important eventsExtremely helpful for one-in-a-million problems in productionChallenging to balance with rate limits

EXAMPLE IN VIVIDCORTEX

Suppose we are sniffing queries off the wire that have occasional warnings or errors, such as say .001% of queries

If we aren’t sampling this query category enough, we won’t have the warning-producing SQL to examine!

PRESTO!

MONGODB QUERY WITH ERROR

[email protected]

@PreetamJinka

linkedin.com/in/preetamjinka

[email protected]

@xaprb

linkedin.com/in/xaprb

Thanks to John Berryman, who helped implement and peer reviewed .


http://linkedin.com/in/preetamjinka


http://linkedin.com/in/xaprb

PHOTO CREDITSChocolates: skrb - https://www.flickr.com/photos/skrb/5984342555

Dew: taufuuu - https://www.flickr.com/photos/ghailon/11565221176

Silhouette: https://www.flickr.com/photos/28481088@N00/2925783507

Bus stop: Robert Couse-Baker - https://www.flickr.com/photos/29233640@N07/14033204315

calla edge: mclcbooks - https://www.flickr.com/photos/39877441@N05/5455416496/

Windmills: omarparada - https://www.flickr.com/photos/omarparada/9776594294

Airplanes: presidioofmonterey - https://www.flickr.com/photos/presidioofmonterey/10710648865

Droplet collision: https://www.flickr.com/photos/69294818@N07/8682467843

1000 layers: doug88888 - https://www.flickr.com/photos/doug88888/3139395660

Balancing Rocks: light_seeker - https://www.flickr.com/photos/light_seeker/7780857224

Capilano Dam: barabanov - https://www.flickr.com/photos/barabanov/4733415724

Survival Bias: hjl - https://www.flickr.com/photos/hjl/15942299782

sampling is hard

Technology

time time

representative events

frequent events

important events

stream of events

time probability approaches

time period

time weve