sampling is hard
TRANSCRIPT
SAMPLING A STREAM OF EVENTS WITH A SKETCH
PREETAM JINKA • BARON SCHWARTZ • VIVIDCORTEX
MONITORAMA • JUNE 2015
INTRODUCTIONS
VividCortex is the best way to see what your databases are doing in production
Preetam Jinka, Software Engineer
@PreetamJinka
Baron Schwartz, CEO/Founder
@xaprb
A STREAM OF EVENTS IN TIME
Time
COMPUTE METRICS ABOUT THE EVENTS
METRICS ALONE ARE NOT ENOUGH
WE WANT SAMPLES OF THESE EVENTS
REPRESENTATIVE SAMPLING IS HARDIt’s Hard To Pick Individual Samples
REPRESENTATIVE SAMPLING MATTERS
EVENTS ARE DIVERSE AND COMPLEX
GOALSSample enough events but not too many
Select representative events
Bias “important” events
Avoid “private” events
Balance sampling between rare and frequent events
Achieve desired overall sampling rate
CONFLICTING GOALSBias towards “important” events, versus rate limitingRare versus frequent versus overall sampling rateCorrectness versus efficiency
POSSIBLE APPROACHESSelect every Nth eventSelect worst event per time periodSelect random event per time period
STATISTICS TO THE RESCUE?If events are generated by a Poisson process, then:
Constant average rate, exponential inter-arrival times
If we choose samples using an exponential probability, then samples would be Poisson too
Time
Probability
(approaches 1.0)
Probability of selecting an event, given the time since last event is 1 - e - λt
USING EXPONENTIAL PROBABILITIES
Time
Probability
Events
WE CHOSE A SIMPLER APPROACH
Time
Probability
Events
We use a linearly increasing probability that will produce a uniform distribution of samples
ANALOGY: WAITING FOR A BUS
Buses arrive at a stop every 5 minutes on averageYou arrive 2 minutes after the last bus leftHow long should you expect to wait?
WHY DO WE DO THAT?Easy to understandLess computationally expensiveReliable in low-frequency scenarios
BEWARE: PROBABILITY GOTCHA
Time
Probability
Events
Should we use time since last event, or time since last sample, to compute the probability of selecting an event?
We Sampled This Event
Which strategy is correct?
USE THE TIME SINCE THE LAST EVENT
Time
Probability
Events
Probabilities are additive, and probability since the last sample grows a lot faster than probability since last event
EFFICIENCY CHALLENGES“Remembering” millions of categories creates memory and CPU load
We use an LRU to “forget” stale categories for efficiency
Lots of edge cases can result (oversampling, undersampling)
EFFICIENCY SOLUTION?We’d like a cheap way to “remember” the last time we’ve seen a category of query, even if it’s approximate
A SKETCH TO THE RESCUE!A sketch is a compact, probabilisitic data structure
Trades off accuracy for resources (CPU, memory)
Similar in nature to a bloom filter
WE “INVENTED” A SKETCH
We were inspired by the Count-Min Sketch
Instead of frequency, we needed last-seen timestamp
We call it the “Last-Seen Sketch”
It is compact and efficient (memory + CPU)
It errs on the side of undersampling
THE LAST-SEEN SKETCH
The sketch is several arrays of timestamps
Categories of events map to cells by hash andmodulus.
Each event will hash & modulus to one cell in each array.
With 4 arrays, it’s stored 4 places.
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 7 9 8 3
ARRAY 1 7 3 2 9 1
ARRAY 2 1 3 9 7
ARRAY 3 3 9 7
STORING AN EVENT’S TIMESTAMP
Store ts=8 for event that hashes to 20. Where are its values stored?
20 % 6 => index 2
20 % 5 => 0
20 % 4 => 0
20 % 3 => 2
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 8 9 8 3
ARRAY 1 8 3 2 9 1
ARRAY 2 8 3 9 7
ARRAY 3 3 9 8
LOOKING UP A VALUE
Example: find stored timestamp for event that hashes to 13.
Indices are 1, 3, 1, 1.
Choose the lowest value.
Result: value is 3.
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 8 9 8 3
ARRAY 1 8 3 2 9 1
ARRAY 2 8 3 9 7
ARRAY 3 3 9 8
PUTTING IT ALL TOGETHER
Events are categorized and flagged in various ways
Important events: long-running, has an error, etcIneligible: has blacklisted text that’s sensitive/private, etc
Events are then eligible for selecting as a sample
PUTTING IT ALL TOGETHER
Probability of selecting the event is determined with the Last-Seen SketchOn collision, we err on the side of undersampling (very small prob)Events are selected and transmitted to our APIs
NOW WE HAVE METRICS + SAMPLES
RATE LIMITSImportant to prevent DOS’ing ourselvesCurrent: implemented with a global sample quota per interval of timeFuture: likely will use a EWMA to influence overall sampling probability
IMPORTANT EVENTSNot all events are created equalBias sampling towards important eventsExtremely helpful for one-in-a-million problems in productionChallenging to balance with rate limits
EXAMPLE IN VIVIDCORTEX
Suppose we are sniffing queries off the wire that have occasional warnings or errors, such as say .001% of queries
If we aren’t sampling this query category enough, we won’t have the warning-producing SQL to examine!
PRESTO!
MONGODB QUERY WITH ERROR
@PreetamJinka
linkedin.com/in/preetamjinka
@xaprb
linkedin.com/in/xaprb
Thanks to John Berryman, who helped implement and peer reviewed .
PHOTO CREDITSChocolates: skrb - https://www.flickr.com/photos/skrb/5984342555
Dew: taufuuu - https://www.flickr.com/photos/ghailon/11565221176
Silhouette: https://www.flickr.com/photos/28481088@N00/2925783507
Bus stop: Robert Couse-Baker - https://www.flickr.com/photos/29233640@N07/14033204315
calla edge: mclcbooks - https://www.flickr.com/photos/39877441@N05/5455416496/
Windmills: omarparada - https://www.flickr.com/photos/omarparada/9776594294
Airplanes: presidioofmonterey - https://www.flickr.com/photos/presidioofmonterey/10710648865
Droplet collision: https://www.flickr.com/photos/69294818@N07/8682467843
1000 layers: doug88888 - https://www.flickr.com/photos/doug88888/3139395660
Balancing Rocks: light_seeker - https://www.flickr.com/photos/light_seeker/7780857224
Capilano Dam: barabanov - https://www.flickr.com/photos/barabanov/4733415724
Survival Bias: hjl - https://www.flickr.com/photos/hjl/15942299782