Download - Sampling Is Hard
![Page 1: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/1.jpg)
SAMPLING A STREAM OF EVENTS WITH A SKETCH
PREETAM JINKA • BARON SCHWARTZ • VIVIDCORTEX
MONITORAMA • JUNE 2015
![Page 2: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/2.jpg)
INTRODUCTIONS
VividCortex is the best way to see what your databases are doing in production
Preetam Jinka, Software Engineer
@PreetamJinka
Baron Schwartz, CEO/Founder
@xaprb
![Page 3: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/3.jpg)
A STREAM OF EVENTS IN TIME
Time
![Page 4: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/4.jpg)
COMPUTE METRICS ABOUT THE EVENTS
![Page 5: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/5.jpg)
METRICS ALONE ARE NOT ENOUGH
![Page 6: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/6.jpg)
WE WANT SAMPLES OF THESE EVENTS
![Page 7: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/7.jpg)
REPRESENTATIVE SAMPLING IS HARDIt’s Hard To Pick Individual Samples
![Page 8: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/8.jpg)
REPRESENTATIVE SAMPLING MATTERS
![Page 9: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/9.jpg)
EVENTS ARE DIVERSE AND COMPLEX
![Page 10: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/10.jpg)
GOALSSample enough events but not too many
Select representative events
Bias “important” events
Avoid “private” events
Balance sampling between rare and frequent events
Achieve desired overall sampling rate
![Page 11: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/11.jpg)
CONFLICTING GOALSBias towards “important” events, versus rate limitingRare versus frequent versus overall sampling rateCorrectness versus efficiency
![Page 12: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/12.jpg)
POSSIBLE APPROACHESSelect every Nth eventSelect worst event per time periodSelect random event per time period
![Page 13: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/13.jpg)
STATISTICS TO THE RESCUE?If events are generated by a Poisson process, then:
Constant average rate, exponential inter-arrival times
If we choose samples using an exponential probability, then samples would be Poisson too
Time
Probability
(approaches 1.0)
![Page 14: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/14.jpg)
Probability of selecting an event, given the time since last event is 1 - e - λt
USING EXPONENTIAL PROBABILITIES
Time
Probability
Events
![Page 15: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/15.jpg)
WE CHOSE A SIMPLER APPROACH
Time
Probability
Events
We use a linearly increasing probability that will produce a uniform distribution of samples
![Page 16: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/16.jpg)
ANALOGY: WAITING FOR A BUS
Buses arrive at a stop every 5 minutes on averageYou arrive 2 minutes after the last bus leftHow long should you expect to wait?
![Page 17: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/17.jpg)
WHY DO WE DO THAT?Easy to understandLess computationally expensiveReliable in low-frequency scenarios
![Page 18: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/18.jpg)
BEWARE: PROBABILITY GOTCHA
Time
Probability
Events
Should we use time since last event, or time since last sample, to compute the probability of selecting an event?
We Sampled This Event
Which strategy is correct?
![Page 19: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/19.jpg)
USE THE TIME SINCE THE LAST EVENT
Time
Probability
Events
Probabilities are additive, and probability since the last sample grows a lot faster than probability since last event
![Page 20: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/20.jpg)
EFFICIENCY CHALLENGES“Remembering” millions of categories creates memory and CPU load
We use an LRU to “forget” stale categories for efficiency
Lots of edge cases can result (oversampling, undersampling)
![Page 21: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/21.jpg)
EFFICIENCY SOLUTION?We’d like a cheap way to “remember” the last time we’ve seen a category of query, even if it’s approximate
![Page 22: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/22.jpg)
A SKETCH TO THE RESCUE!A sketch is a compact, probabilisitic data structure
Trades off accuracy for resources (CPU, memory)
Similar in nature to a bloom filter
![Page 23: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/23.jpg)
WE “INVENTED” A SKETCH
We were inspired by the Count-Min Sketch
Instead of frequency, we needed last-seen timestamp
We call it the “Last-Seen Sketch”
It is compact and efficient (memory + CPU)
It errs on the side of undersampling
![Page 24: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/24.jpg)
THE LAST-SEEN SKETCH
The sketch is several arrays of timestamps
Categories of events map to cells by hash andmodulus.
Each event will hash & modulus to one cell in each array.
With 4 arrays, it’s stored 4 places.
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 7 9 8 3
ARRAY 1 7 3 2 9 1
ARRAY 2 1 3 9 7
ARRAY 3 3 9 7
![Page 25: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/25.jpg)
STORING AN EVENT’S TIMESTAMP
Store ts=8 for event that hashes to 20. Where are its values stored?
20 % 6 => index 2
20 % 5 => 0
20 % 4 => 0
20 % 3 => 2
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 8 9 8 3
ARRAY 1 8 3 2 9 1
ARRAY 2 8 3 9 7
ARRAY 3 3 9 8
![Page 26: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/26.jpg)
LOOKING UP A VALUE
Example: find stored timestamp for event that hashes to 13.
Indices are 1, 3, 1, 1.
Choose the lowest value.
Result: value is 3.
TS 0 TS 1 TS 2 TS 3 TS 4 TS 5
ARRAY 0 1 4 8 9 8 3
ARRAY 1 8 3 2 9 1
ARRAY 2 8 3 9 7
ARRAY 3 3 9 8
![Page 27: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/27.jpg)
PUTTING IT ALL TOGETHER
Events are categorized and flagged in various ways
Important events: long-running, has an error, etcIneligible: has blacklisted text that’s sensitive/private, etc
Events are then eligible for selecting as a sample
![Page 28: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/28.jpg)
PUTTING IT ALL TOGETHER
Probability of selecting the event is determined with the Last-Seen SketchOn collision, we err on the side of undersampling (very small prob)Events are selected and transmitted to our APIs
![Page 29: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/29.jpg)
NOW WE HAVE METRICS + SAMPLES
![Page 30: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/30.jpg)
RATE LIMITSImportant to prevent DOS’ing ourselvesCurrent: implemented with a global sample quota per interval of timeFuture: likely will use a EWMA to influence overall sampling probability
![Page 31: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/31.jpg)
IMPORTANT EVENTSNot all events are created equalBias sampling towards important eventsExtremely helpful for one-in-a-million problems in productionChallenging to balance with rate limits
![Page 32: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/32.jpg)
EXAMPLE IN VIVIDCORTEX
Suppose we are sniffing queries off the wire that have occasional warnings or errors, such as say .001% of queries
If we aren’t sampling this query category enough, we won’t have the warning-producing SQL to examine!
![Page 33: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/33.jpg)
PRESTO!
![Page 34: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/34.jpg)
MONGODB QUERY WITH ERROR
![Page 35: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/35.jpg)
@PreetamJinka
linkedin.com/in/preetamjinka
@xaprb
linkedin.com/in/xaprb
Thanks to John Berryman, who helped implement and peer reviewed .
![Page 36: Sampling Is Hard](https://reader038.vdocuments.us/reader038/viewer/2022110312/55bec10cbb61eb1f7b8b4727/html5/thumbnails/36.jpg)
PHOTO CREDITSChocolates: skrb - https://www.flickr.com/photos/skrb/5984342555
Dew: taufuuu - https://www.flickr.com/photos/ghailon/11565221176
Silhouette: https://www.flickr.com/photos/28481088@N00/2925783507
Bus stop: Robert Couse-Baker - https://www.flickr.com/photos/29233640@N07/14033204315
calla edge: mclcbooks - https://www.flickr.com/photos/39877441@N05/5455416496/
Windmills: omarparada - https://www.flickr.com/photos/omarparada/9776594294
Airplanes: presidioofmonterey - https://www.flickr.com/photos/presidioofmonterey/10710648865
Droplet collision: https://www.flickr.com/photos/69294818@N07/8682467843
1000 layers: doug88888 - https://www.flickr.com/photos/doug88888/3139395660
Balancing Rocks: light_seeker - https://www.flickr.com/photos/light_seeker/7780857224
Capilano Dam: barabanov - https://www.flickr.com/photos/barabanov/4733415724
Survival Bias: hjl - https://www.flickr.com/photos/hjl/15942299782