no bs data salon #3: probabilistic sketching

21
Analytics + Attribution = Actionable Insights No BS Data Salon #3: Probabilistic Sketching May 2012

Upload: timonk

Post on 23-Jan-2015

1.189 views

Category:

Technology


3 download

DESCRIPTION

Timon Karnezos' presentation on probabilistic sketching and distinct counting with HyperLogLog from the third No BS Data Salon on May 19th, 2012.

TRANSCRIPT

Page 1: No BS Data Salon #3: Probabilistic Sketching

Analytics + Attribution = Actionable Insights

No BS Data Salon #3:Probabilistic Sketching

May 2012

Page 2: No BS Data Salon #3: Probabilistic Sketching

2

Outline

What we do at AK

What’s sketching?

Our motivation for sketching

Why should you sketch?

Our case: unique counting How it works

How well it works

How we use them

Page 3: No BS Data Salon #3: Probabilistic Sketching

3

Here’s what we do at AK.

Online ad analyticsCompare performance of different: campaigns, inventory,

providers, creatives, etc…

Bottom Line:Give the advertisers insight into the performance of their ads.

Page 4: No BS Data Salon #3: Probabilistic Sketching

4

Motivation

High throughput: 10s of K/s => 100s of K/s

High dimensionality: 100M+ reporting keys

Easy aggregates: counters, scalars

Hard aggregates: unique user counting, set operations

No cheap or effective “online” solutions Streaming DBs (Truviso, Coral8, StreamBase) insufficient

Warehouse appliances (Aster, custom PG) same

Our data is immutable. Paying for unneeded ACID is silly.

Offline solutions slow, operationally finicky.

Not a bank. We don’t need to be perfect, just useful.

Page 5: No BS Data Salon #3: Probabilistic Sketching

5

Why should you bother?

SELECT COUNT(DISTINCT user_id)

FROM access_logs

GROUP BY campaign_id

Page 6: No BS Data Salon #3: Probabilistic Sketching

6

What is probabilistic sketching?

One-pass

“Small” memory

Probabilistic error

Page 7: No BS Data Salon #3: Probabilistic Sketching

7

Our Case Study: unique counting

Non-unique stream of ints

Want to keep unique count, up to about a billion

Want to do set operations (union, intersection, set difference)

Straw Man #1: “Put them in a HashSet, and go away.”

(Maybe) Straw Man #2: “Fine, keep a sample.”

How we did it: HyperLogLog

Page 8: No BS Data Salon #3: Probabilistic Sketching

8

How it works

The Papers: LogLog Counting of Large Cardinalities

Marianne Durand and Philippe Flajolet (RIP 2010), 2003

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet, Fusy, Gandouet, Meunier, 2007

The (rudimentary, unrigorous) Intuition:

Flip fair coins Longest streak of heads is length k, seen once Probability of streak ≈ (½)k

E[x] = 1, p = (½)k => n ≈ 2k

Page 9: No BS Data Salon #3: Probabilistic Sketching

9

How it works cont’d

1. Stream of int_64 => “good” hash => random {0,1}64

2. Keep track of longest run of leading zeroes

3. Longest run of length k => cardinality ≈2k

Crazy math business Correct systematic bias with a derived constant

Stochastic averaging

Balls and bins correction

Page 10: No BS Data Salon #3: Probabilistic Sketching

10

Here’s what you get

Native:union, cardinality

Implies:intersection (!!!), set difference (!!!)

Page 11: No BS Data Salon #3: Probabilistic Sketching

11

Show me the money!

Used in production at AK for a year

Accurate: count to a billion with 1-3% error

Small: a few KB each so we can keep 100s of M in memory

Fast: benched at 2M inserts/s, used in production at 100s of K/s

Page 12: No BS Data Salon #3: Probabilistic Sketching

12

Lies, damn lies, and boxplots!

Page 13: No BS Data Salon #3: Probabilistic Sketching

13

But wait, there’s more!

Page 14: No BS Data Salon #3: Probabilistic Sketching

14

Implementation caveats

If you store an HLL for each key, you’ll likely be wasting space when all the registers aren’t set. Use map-based HLL or use compression.

Pick a good hash function!

Test on your data!

Tune parameters to suit your business needs!

Page 15: No BS Data Salon #3: Probabilistic Sketching

15

How we use them, in production

Original problem: fast, on-the-fly overlaps and unique counts

Solution: streaming, in-memory aggregations shipped to Postgres

Postgres module to do set operations on binary representations in the DB

Freebie: PG analytics support like GROUP BY, sliding windows, etc…

Page 16: No BS Data Salon #3: Probabilistic Sketching

16

UI example

To the browser, Robin!

Page 17: No BS Data Salon #3: Probabilistic Sketching

17

How we use them, Ad Hoc

Outside of production: amazing ad-hoc analysis tool

Example: gathering more than a year’s worth of data for an RFP, at 20B impressions/month painless and quick when we had the data as sketches

much more effort to put it through Hadoop

Iterating on product and research is cheaper and faster. Waiting minutes instead of seconds between iterations is painful.

Page 18: No BS Data Salon #3: Probabilistic Sketching

18

“Soft” Caveats

Fixed N% error is deceiving

Additive error for set operations can balloon

Unbounded error sneaks in now and again

Page 19: No BS Data Salon #3: Probabilistic Sketching

19

Parting Advice

Test these on your data rigorously

Choose good hash functions

Tuning parameters are particularly sensitive

You’ll find all kinds of unexpected uses for them, so get building!

Bibliography blog post will be up in a bit!

Page 20: No BS Data Salon #3: Probabilistic Sketching

20

Questions?

@timonk

[email protected]

blog.aggregateknowledge.com

Page 21: No BS Data Salon #3: Probabilistic Sketching

21

Credits

All the adorable cartoons you saw in this presentation were taken from http://sureilldrawthat.com/ and http://sureilldrawthat.tumblr.com/ and belong to him/her.