no bs data salon #3: probabilistic sketching

Analytics + Attribution = Actionable Insights

No BS Data Salon #3:Probabilistic Sketching

May 2012

2

Outline

What we do at AK

What’s sketching?

Our motivation for sketching

Why should you sketch?

Our case: unique counting How it works

How well it works

How we use them

3

Here’s what we do at AK.

Online ad analyticsCompare performance of different: campaigns, inventory,

providers, creatives, etc…

Bottom Line:Give the advertisers insight into the performance of their ads.

4

Motivation

High throughput: 10s of K/s => 100s of K/s

High dimensionality: 100M+ reporting keys

Easy aggregates: counters, scalars

Hard aggregates: unique user counting, set operations

No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient

Warehouse appliances (Aster, custom PG) same

Our data is immutable. Paying for unneeded ACID is silly.

Offline solutions slow, operationally finicky.

Not a bank. We don’t need to be perfect, just useful.

5

Why should you bother?

SELECT COUNT(DISTINCT user_id)

FROM access_logs

GROUP BY campaign_id

6

What is probabilistic sketching?

One-pass

“Small” memory

Probabilistic error

7

Our Case Study: unique counting

Non-unique stream of ints

Want to keep unique count, up to about a billion

Want to do set operations (union, intersection, set difference)

Straw Man #1: “Put them in a HashSet, and go away.”

(Maybe) Straw Man #2: “Fine, keep a sample.”

How we did it: HyperLogLog

8

How it works

The Papers: LogLog Counting of Large Cardinalities

Marianne Durand and Philippe Flajolet (RIP 2010), 2003

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007

The (rudimentary, unrigorous) Intuition:

Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k

E[x] = 1, p = (½)k => n ≈ 2k

9

How it works cont’d

1. Stream of int_64 => “good” hash => random {0,1}64

2. Keep track of longest run of leading zeroes

3. Longest run of length k => cardinality ≈2k

Crazy math business Correct systematic bias with a derived constant

Stochastic averaging

Balls and bins correction

10

Here’s what you get

Native:union, cardinality

Implies:intersection (!!!), set difference (!!!)

11

Show me the money!

Used in production at AK for a year

Accurate: count to a billion with 1-3% error

Small: a few KB each so we can keep 100s of M in memory

Fast: benched at 2M inserts/s, used in production at 100s of K/s

12

Lies, damn lies, and boxplots!

13

But wait, there’s more!

14

Implementation caveats

If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.

Pick a good hash function!

Test on your data!

Tune parameters to suit your business needs!

15

How we use them, in production

Original problem: fast, on-the-fly overlaps and unique counts

Solution: streaming, in-memory aggregations shipped to Postgres

Postgres module to do set operations on binary representations in the DB

Freebie: PG analytics support like GROUP BY, sliding windows, etc…

16

UI example

To the browser, Robin!

17

How we use them, Ad Hoc

Outside of production: amazing ad-hoc analysis tool

Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches

much more effort to put it through Hadoop

Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful.

18

“Soft” Caveats

Fixed N% error is deceiving

Additive error for set operations can balloon

Unbounded error sneaks in now and again

19

Parting Advice

Test these on your data rigorously

Choose good hash functions

Tuning parameters are particularly sensitive

You’ll find all kinds of unexpected uses for them, so get building!

Bibliography blog post will be up in a bit!

20

Questions?

@timonk

[email protected]

blog.aggregateknowledge.com

21

Credits

All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her.

http://sureilldrawthat.com/

http://sureilldrawthat.com/

http://sureilldrawthat.tumblr.com/

http://sureilldrawthat.tumblr.com/

no bs data salon #3: probabilistic sketching

Technology