no bs data salon #3: probabilistic sketching
DESCRIPTION
Timon Karnezos' presentation on probabilistic sketching and distinct counting with HyperLogLog from the third No BS Data Salon on May 19th, 2012.TRANSCRIPT
Analytics + Attribution = Actionable Insights
No BS Data Salon #3:Probabilistic Sketching
May 2012
2
Outline
What we do at AK
What’s sketching?
Our motivation for sketching
Why should you sketch?
Our case: unique counting How it works
How well it works
How we use them
3
Here’s what we do at AK.
Online ad analyticsCompare performance of different: campaigns, inventory,
providers, creatives, etc…
Bottom Line:Give the advertisers insight into the performance of their ads.
4
Motivation
High throughput: 10s of K/s => 100s of K/s
High dimensionality: 100M+ reporting keys
Easy aggregates: counters, scalars
Hard aggregates: unique user counting, set operations
No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient
Warehouse appliances (Aster, custom PG) same
Our data is immutable. Paying for unneeded ACID is silly.
Offline solutions slow, operationally finicky.
Not a bank. We don’t need to be perfect, just useful.
5
Why should you bother?
SELECT COUNT(DISTINCT user_id)
FROM access_logs
GROUP BY campaign_id
6
What is probabilistic sketching?
One-pass
“Small” memory
Probabilistic error
7
Our Case Study: unique counting
Non-unique stream of ints
Want to keep unique count, up to about a billion
Want to do set operations (union, intersection, set difference)
Straw Man #1: “Put them in a HashSet, and go away.”
(Maybe) Straw Man #2: “Fine, keep a sample.”
How we did it: HyperLogLog
8
How it works
The Papers: LogLog Counting of Large Cardinalities
Marianne Durand and Philippe Flajolet (RIP 2010), 2003
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007
The (rudimentary, unrigorous) Intuition:
Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k
E[x] = 1, p = (½)k => n ≈ 2k
9
How it works cont’d
1. Stream of int_64 => “good” hash => random {0,1}64
2. Keep track of longest run of leading zeroes
3. Longest run of length k => cardinality ≈2k
Crazy math business Correct systematic bias with a derived constant
Stochastic averaging
Balls and bins correction
10
Here’s what you get
Native:union, cardinality
Implies:intersection (!!!), set difference (!!!)
11
Show me the money!
Used in production at AK for a year
Accurate: count to a billion with 1-3% error
Small: a few KB each so we can keep 100s of M in memory
Fast: benched at 2M inserts/s, used in production at 100s of K/s
12
Lies, damn lies, and boxplots!
13
But wait, there’s more!
14
Implementation caveats
If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.
Pick a good hash function!
Test on your data!
Tune parameters to suit your business needs!
15
How we use them, in production
Original problem: fast, on-the-fly overlaps and unique counts
Solution: streaming, in-memory aggregations shipped to Postgres
Postgres module to do set operations on binary representations in the DB
Freebie: PG analytics support like GROUP BY, sliding windows, etc…
16
UI example
To the browser, Robin!
17
How we use them, Ad Hoc
Outside of production: amazing ad-hoc analysis tool
Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches
much more effort to put it through Hadoop
Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful.
18
“Soft” Caveats
Fixed N% error is deceiving
Additive error for set operations can balloon
Unbounded error sneaks in now and again
19
Parting Advice
Test these on your data rigorously
Choose good hash functions
Tuning parameters are particularly sensitive
You’ll find all kinds of unexpected uses for them, so get building!
Bibliography blog post will be up in a bit!
21
Credits
All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her.