cs 361a1 cs 361a (advanced data structures and algorithms) lectures 16 & 17 (nov 16 and 28,...

72
CS 361A 1 CS 361A CS 361A (Advanced Data Structures and (Advanced Data Structures and Algorithms) Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

Post on 21-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 1

CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)

Lectures 16 & 17 (Nov 16 and 28, 2005)

Synopses, Samples, and Sketches

Rajeev Motwani

Page 2: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 2

Game Plan for WeekGame Plan for WeekLast Class

Models for Streaming/Massive Data Sets

Negative results for Exact Distinct Values

Hashing for Approximate Distinct Values

TodaySynopsis Data Structures

Sampling Techniques

Frequency Moments Problem

Sketching Techniques

Finding High-Frequency Items

Page 3: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 3

Synopsis Data StructuresSynopsis Data Structures Synopses

Webster – a condensed statement or outline (as of a narrative or treatise)

CS 361A – succinct data structure that lets us answers queries efficiently

Synopsis Data Structures“Lossy” Summary (of a data stream)

Advantages – fits in memory + easy to communicate

Disadvantage – lossiness implies approximation error

Negative Results best we can do

Key Techniques – randomization and hashing

Page 4: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 4

Numerical ExamplesNumerical ExamplesApproximate Query Processing [AQUA/Bell Labs]

Database Size – 420 MB

Synopsis Size – 420 KB (0.1%)

Approximation Error – within 10%

Running Time – 0.3% of time for exact query

Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,

Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 109 items

Synopsis Size – 1249 items

Approximation Error – within 1%

Page 5: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 5

SynopsesSynopses Desidarata

Small Memory Footprint

Quick Update and Query

Provable, low-error guarantees

Composable – for distributed scenario

Applicability?General-purpose – e.g. random samples

Specific-purpose – e.g. distinct values estimator

Granularity?Per database – e.g. sample of entire table

Per distinct value – e.g. customer profiles

Structural – e.g. GROUP-BY or JOIN result samples

Page 6: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 6

Examples of SynopsesExamples of SynopsesSynopses need not be fancy!

Simple Aggregates – e.g. mean/median/max/min

Variance?

Random Samples

Aggregates on small samples represent entire data

Leverage extensive work on confidence intervals

Random Sketches

structured samples

Tracking High-Frequency Items

Page 7: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 7

Random SamplesRandom Samples

Page 8: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 8

Types of SamplesTypes of Samples Oblivious sampling – at item level

o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]

Value-based sampling – e.g. distinct-value samples

Structured samples – e.g. join samplingNaïve approach – keep samples of each relation

Problem – sample-of-join ‡ join-of-samples

Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]

what if A sampled from L and B from R?

AABB

L R

AB

Page 9: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 9

Basic ScenarioBasic ScenarioGoal maintain uniform sample of item-stream

Sampling Semantics?Coin flip

o select each item with probability po easy to maintaino undesirable – sample size is unbounded

Fixed-size sample without replacemento Our focus today

Fixed-size sample with replacemento Show – can generate from previous sample

Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]

Page 10: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 10

Reservoir Sampling [Vitter]Reservoir Sampling [Vitter]Input – stream of items X1 , X2, X3, …

Goal – maintain uniform random sample S of size n (without replacement) of stream so far

Reservoir Sampling

Initialize – include first n elements in S

Upon seeing item Xt

o Add Xt to S with probability n/to If added, evict random previous item

Page 11: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 11

AnalysisAnalysis Correctness?

Fact: At each instant, |S| = n

Theorem: At time t, any XiεS with probability n/t

Exercise – prove via induction on t

Efficiency?Let N be stream size

Remark: Verify this is optimal.

Naïve implementation N coin flips time O(N)

n

Nln n O )HHn(1

t

n S] toupdates E[# nN

Nt 1n

Page 12: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 12

Improving EfficiencyImproving Efficiency

Random variable Jt – number jumped over after time t

Idea – generate Jt and skip that many items

Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

items inserted into sample S (where n=3)

J9=4J3=2

1)b(a2)1)(aa(abawhere

s1)(t

sn)1(t1

T

n-T1

T

n-11F(s)

st

1tT

st

1tT

Page 13: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 13

AnalysisAnalysisNumber of calls to RANDOM()?

one per insertion into sample

this is optimal!

Generating Jt?Pick random number U ε [0,1]

Find smallest j such that U ≤ F(j)

How?o Linear scan O(N) time o Binary search with Newton’s interpolation

O(n2(1 + polylog N/n)) time

Remark – see paper for optimal algorithm

Page 14: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 14

Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]

Sliding Window W – last w items in stream

Model – item Xt expires at time t+w

Why?

Applications may require ignoring stale data

Type of approximation

Only way to define JOIN over streams

Goal – Maintain uniform sample of size n of sliding window

Page 15: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 15

Reservoir Sampling?Reservoir Sampling?

Observeany item in sample S will expire eventually

must replace with random item of current window

Problem

no access to items in W-S

storing entire window requires O(w) memory

OversamplingBacking sample B – select each item with probability

sample S – select n items from B at random

upon expiry in S replenish from B

Claim: n < |B| < n log w with high probability

w

wlogn θ

Page 16: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 16

Index-Set ApproachIndex-Set Approach

Pick random index set I= { i1, … , in }, X{0,1, … , w-1}

Sample S – items Xi with i ε {i1, … , in} (mod w) in current window

ExampleSuppose – w=2, n=1, and I={1}

Then – sample is always Xi with odd i

Memory – only O(k)

ObserveS is uniform random sample of each windowBut sample is periodic (union of arithmetic progressions)Correlation across successive windows

ProblemsCorrelation may hurt in some applicationsSome data (e.g. time-series) may be periodic

Page 17: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 17

Chain-Sample AlgorithmChain-Sample Algorithm Idea

Fix expiry problem in Reservoir Sampling

Advance planning for expiry of sampled items

Focus on sample size 1 – keep n independent such samples

Chain-SamplingAdd Xt to S with probability 1/min{t,w} – evict earlier sample

Initially – standard Reservoir Sampling up to time w

Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}

o Xt expires must replace from Wt+w

o At time r, save Xr and pre-select its own replacement building “chain” of potential replacements

Note – if evicting earlier sample, discard its “chain” as well

Page 18: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 18

ExampleExample

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

Page 19: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 19

Expectation for Chain-SampleExpectation for Chain-Sample

T(x) = E[chain length for Xt at time t+x]

E[chain length] = T(w) e 2.718

E[memory required for sample size n] = O(n)

1xforxi T(i)

w

11

1xfor1T(x)

Page 20: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 20

Tail Bound for Chain-SampleTail Bound for Chain-Sample Chain = “hops” of total length at most w

Chain of h hops ordered (h+1)-partition of wh hops of total length less than w

plus, remainder

Each partition has probability w-h

Number of partitions:

h = O(log w) probability of a partition is O(w-c)

Thus – memory O(n log w) with high probability

h

h

ew

h

w

Page 21: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 21

Comparison of AlgorithmsComparison of Algorithms

Chain-Sample beats Oversample:

Expected memory – O(n) vs O(n log w)

High-probability memory bound – both O(n log w)

Oversample may have sample size shrink below n!

Algorithm Expected High-Probability

Periodic O(n) O(n)

Oversample O(n log w) O(n log w)

Chain-Sample O(n) O(n log w)

Page 22: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 22

SketchesSketchesandand

Frequency MomentsFrequency Moments

Page 23: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 23

Generalized Stream ModelGeneralized Stream Model

Input Element (i,a)

a copies of domain-value i

increment to ith dimension of m by a

a need not be an integer

Negative value – captures deletions

Data stream: 2, 0, 1, 3, 1, 2, 4, . . .

m0 m1 m2 m3 m4

11 1

2 2

Page 24: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 24

ExampleExample

m0 m1 m2 m3 m4

11 1

2 2

On seeing element (i,a) = (2,2)

m0 m1 m2 m3 m4

11 1

2

4On seeing element (i,a) = (1,-1)

m0 m1 m2 m3 m4

11 1

4

1

Page 25: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 25

Frequency MomentsFrequency Moments Input Stream

values from U = {0,1,…,N-1}

frequency vector m = (m0,m1,…,mN-1)

Kth Frequency Moment Fk(m) = Σi mik

F0: number of distinct values (Lecture 15)

F1: stream size

F2: Gini index, self-join size, Euclidean norm

Fk: for k>2, measures skew, sometimes useful

F∞: maximum frequency

Problem – estimation in small space

Sketches – randomized estimators

Page 26: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 26

Naive ApproachesNaive ApproachesSpace N – counter mi for each distinct value i

Space O(1)

if input sorted by i

single counter recycled when new i value appears

Goal

Allow arbitrary input

Use small (logarithmic) space

Settle for randomization/approximation

Page 27: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 27

Sketching FSketching F22

Random Hash h(i): {0,1,…,N-1} {-1,1}

Define Zi =h(i)

Maintain X = Σi miZi

Easy for update streams (i,a) – just add aZi to X

Claim: X2 is unbiased estimator for F2

Proof: E[X2] = E[(Σi miZi)2]

= E[Σi mi2Zi

2] + E[Σi,jmimjZiZj]

= Σi mi2E[Zi

2] + Σi,jmimjE[Zi]E[Zj]

= Σi mi2 + 0 = F2

Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}

fromindependence

Page 28: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 28

Estimation Error?Estimation Error? Chebyshev bound:

Define Y = X2 E[Y] = E[X2] = Σi mi2 = F2

Observe E[X4] = E[(ΣmiZi)4]

= E[Σmi4Zi

4]+4E[Σmimj3ZiZj

3]+6E[Σmi2mj

2Zi2Zj

2]

+12E[Σmimjmk2ZiZjZk

2]+24E[ΣmimjmkmlZiZjZkZl]

= Σmi4 + 6Σmi

2mj2

By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2

= [Σmi4+6Σmi

2mj2] – [Σmi

4+2Σmi2mj

2]

= 4Σmi2mj

2 ≤ 2E[X2]2 = 2F22

2 2 YEλ

YVarYλEYEYP

Why?

Page 29: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 29

Estimation Error?Estimation Error? Chebyshev bound

P [relative estimation error >λ]

Problem – What if we want λ really small?

SolutionCompute s = 8/λ2 independent copies of X

Estimator Y = mean(Xi2)

Variance reduces by factor s

P [relative estimation error >λ]

22 YEλ

YVarYλEYEYP

222

2

22

λ

2

2F

4

1

Fsλ

2F2

22

22

Page 30: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 30

Boosting TechniqueBoosting Technique Algorithm A: Randomized λ-approximate estimator f

P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4

Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]

Boosting IdeaO(log1/ε) independent estimates from A(X)

Return median of estimates

Claim: P[median is λ-approximate] >1- ε Proof:

P[specific estimate is λ-approximate] = ¾

Bad event only if >50% estimates not λ-approximate

Binomial tail – probability less than ε

Page 31: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 31

Overall Space RequirementOverall Space RequirementObserve

Let m = Σmi

Each hash needs O(log m)-bit counter

s = 8/λ2 hash functions for each estimator

O(log 1/ε) such estimators

Total O(λ-2 log 1/ε log m) bits

Question – Space for storing hash function?

Page 32: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 32

Sketching ParadigmSketching Paradigm Random Sketch: inner product

frequency vector m = (m0,m1,…,mN-1)

random vector Z (currently, uniform {-1,1})

ObserveLinearity Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)

Ideal for distributed computing

Observe Suppose: Given i, can efficiently generate Zi

Then: can maintain sketch for update streams

Problemo Must generate Zi=h(i) on first appearance of io Need Ω(N) memory to store h explicitlyo Need Ω(N) random bits

i if(i)ZZf,

Page 33: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 33

Two birds, One stoneTwo birds, One stone Pairwise Independent Z1,Z2, …, Zn

for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]

property E[ZiZk] = E[Zi].E[Zk]

Example – linear hash functionSeed S=<a,b> from [0..p-1], where p is prime

Zi = h(i) = ai+b (mod p)

Claim: Z1,Z2, …, Zn are pairwise independent

Zi=x and Zk=y x=ai+b (mod p) and y=ak+b (mod p)

fixing i, k, x, y unique solution for a, b

P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]

Memory/Randomness: n log p 2 log p

Page 34: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 34

Wait a minute!Wait a minute! Doesn’t pairwise independence screw up proofs?

No – E[X2] calculation only has degree-2 terms

But – what about Var[X2]?

Need 4-wise independence

Page 35: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 35

Application – Join-Size EstimationApplication – Join-Size Estimation

GivenJoin attribute frequencies f1 and f2

Join size = f1.f2

Define – X1 = f1.Z and X2 = f2.Z

Choose – Z as 4-wise independent & uniform {-1,1}

Exercise: Show, as before,

E[X1 X2] = f1.f2

Var[X1 X2] ≤ 2 (f1.f2)2

Hint: a.b ≤ |a|.|b|

Page 36: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 36

Bounding Error ProbabilityBounding Error Probability Using s copies of X’s & taking their mean Y

Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2

≤ 2f12f2

2 / sλ2(f1.f2)2

= 2 / sλ2cos2 θ

Bounding error probability?Need – s > 2/λ2cos2θ

Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))

ProblemTo choose s – need a-priori lower bound on cos θ = f1.f2

What if cos θ really small?

Page 37: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 37

Sketch PartitioningSketch Partitioning

dom(R1.A)

10

12

10

dom(R2.B)

10 10

12

self-join(R1.A)*self-join(R2.B) = 205*205 = 42K

self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K

Idea for dealing with f12f2

2/(f1.f2)2 issue-- partition domain into regions whereself-join size is smaller to compensatesmall join-size (cos θ)

Page 38: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 38

Sketch PartitioningSketch Partitioning

Idea

intelligently partition join-attribute space

need coarse statistics on stream

build independent sketches for each partition

Estimate = Σ partition sketches

Variance = Σ partition variances

Page 39: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 39

Sketch PartitioningSketch Partitioning

Partition Space Allocation?

Can solve optimally, given domain partition

Optimal Partition: Find K-partition to minimize

Results

Dynamic Programming – optimal solution for single join

NP-hard – for queries with multiple joins

K

1

K

1i oin)size(selfJ]Var[X

Page 40: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 40

FFkk for k > 2 for k > 2

Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)

Choose – random stream item ap p uniform from {1,2,…,m}

Suppose – ap = v ε {0,1,…,N-1}

Count subsequent frequency of v

r = | {q | q≥p, aq=v} |

Define X = m(rk – (r-1)k)

Page 41: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 41

ExampleExampleStream

7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8

m = 20

p = 9

ap = 5

r = 3

Page 42: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 42

FFkk for k > 2 for k > 2

Var(X) ≤ kN1 – 1/k Fk2

Bounded Error Probability s = O(kN1 – 1/k / λ2)

Boosting memory bound

O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))

k

kn

kn

kkk

k2

k2

kkk

k1

k1

kkk

F

)]1)(m(m...)1(21

...

)1)(m(m...)1(21

)1)(m(m...)1(2[1m

mXE

Summing overm choices of

stream elements

Page 43: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 43

Frequency MomentsFrequency MomentsF0 – distinct values problem (Lecture 15)

F1 – sequence lengthfor case with deletions, use Cauchy distribution

F2 – self-join size/Gini index (Today)

Fk for k >2omitting grungy details

can achieve space bound O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))

F∞ – maximum frequency

Page 44: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 44

Communication ComplexityCommunication Complexity

Cooperatively compute function f(A,B) Minimize bits communicated

Unbounded computational power

Communication Complexity C(f) – bits exchanged by optimal protocol Π

Protocols?1-way versus 2-way

deterministic versus randomized

Cδ(f) – randomized complexity for error probability δ

ALICEinput A

BOBinput B

Page 45: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 45

Streaming & Communication ComplexityStreaming & Communication Complexity

Stream Algorithm 1-way communication protocol

Simulation ArgumentGiven – algorithm S computing f over streams

Alice – initiates S, providing A as input stream prefix

Communicates to Bob – S’s state after seeing A

Bob – resumes S, providing B as input stream suffix

Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)

Page 46: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 46

Example: Set DisjointnessExample: Set DisjointnessSet Disjointness (DIS)

A, B subsets of {1,2,…,N}

Output

Theorem: Cδ(DIS) = Ω(N), for any δ<1/2

φBA0

φBA1

Page 47: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 47

Lower Bound for FLower Bound for F∞∞

Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with

P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ

needs Ω(N) space

ProofClaim: S 1-way protocol for DIS (on any sets A and B)

Alice streams set A to S

Communicates S’s state to Bob

Bob streams set B to S

Observe

Relative Error ε<1/3 DIS solved exactly!

P[error <½ ] < δ Ω(N) space

φBAif1

φBAif2F

Page 48: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 48

ExtensionsExtensions Observe

Used only 1-way communication in proof

Cδ(DIS) bound was for arbitrary communication

Exercise – extend lower bound to multi-pass algorithms

Lower Bound for Fk, k>2

Need to increase gap beyond 2

Multiparty Set Disjointness – t players

Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with

P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ

needs Ω(N1-(2+ δ)/k) space

Implies Ω(N1/2) even for multi-pass algorithms

Page 49: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 49

Tracking Tracking High-Frequency ItemsHigh-Frequency Items

Page 50: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 50

Problem 1 – Top-K ListProblem 1 – Top-K List[Charikar-Chen-Farach-Colton]

The Google Problem

Return list of k most frequent items in stream

Motivation

search engine queries, network traffic, …

Remember

Saw lower bound recently!

Solution

Data structure Count-Sketch maintaining count-estimates of high-frequency elements

Page 51: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 51

DefinitionsDefinitions Notation

Assume {1, 2, …, N} in order of frequency

mi is frequency of ith most frequent element

m = Σmi is number of elements in stream

FindCandidateTopInput: stream S, int k, int p

Output: list of p elements containing top k

Naive sampling gives solution with p = (m log k / mk)

FindApproxTopInput: stream S, int k, real

Output: list of k elements, each of frequency mi > (1-) mk

Naive sampling gives no solution

Page 52: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 52

Main IdeaMain Idea Consider

single counter X

hash function h(i): {1, 2,…,N} {-1,+1}

Input element i update counter X += Zi = h(i)

For each r, use XZr as estimator of mr

Theorem: E[XZr] = mr Proof

X = Σi miZi

E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr

Cross-terms cancel

Page 53: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 53

Finding Max Frequency ElementFinding Max Frequency Element Problem – var[X] = F2 = Σi mi

2

Idea – t counters, independent 4-wise hashes h1,…,ht

Use t = O(log m • mi2 / (m1)2)

Claim: New Variance < mi2 / t = (m1)2 / log m

Overall Estimatorrepeat + median of averages

with high probability, approximate m1

h1: i {+1, –1}

ht: i {+1, –1}

Page 54: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 54

Problem with “Array of Counters”Problem with “Array of Counters”Variance – dominated by highest frequency

Estimates for less-frequent elements like k

corrupted by higher frequencies

variance >> mk

Avoiding Collisions?

spread out high frequency elements

replace each counter with hashtable of b counters

Page 55: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 55

Count SketchCount Sketch Hash Functions

4-wise independent hashes h1,...,ht and s1,…,st

hashes independent of each other

Data structure: hashtables of counters X(r,c)

s1 : i {1, ..., b}

h1: i {+1, -1}

st : i {1, ..., b}

ht: i {+1, -1}

1 2 … b

Page 56: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 56

Overall AlgorithmOverall Algorithm sr(i) – one of b counters in rth hashtable

Input i for each r, update X(r,sr(i)) += hr(i)

Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }

Maintain heap of k top elements seen so far

ObserveNot completely eliminated collision with high frequency items

Few of estimates X(r,sr(i)) • hr(i) could have high variance

Median not sensitive to these poor estimates

Page 57: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 57

Avoiding Large ItemsAvoiding Large Items b > O(k) with probability Ω(1), no collision with top-k elements

t hashtables represent independent trials

Need log m/ trials to estimate with probability 1-

Also need – small variance for colliding small elements

Claim:

P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)

Final bound b = O(k + i>k mi2 / (mk)2)

Page 58: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 58

Final ResultsFinal Results

Zipfian Distribution: mi 1/i [Power Law]

FindApproxTop

[k + (i>kmi2) / (mk)2] log m/

Roughly: sampling bound with frequencies squared

Zipfian – gives improved results

FindCandidateTop

Zipf parameter 0.5

O(k log N log m)

Compare: sampling bound O((kN)0.5 log k)

Page 59: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 59

Problem 2 – Elephants-and-AntsProblem 2 – Elephants-and-Ants[Manku-Motwani]

Identify items whose current frequency exceeds support threshold s = 0.1%.[Jacobson 2000, Estan-Verghese 2001]

Stream

Page 60: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 60

Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting

Step 1: Divide the stream into ‘windows’

Window-size w is function of support s – specify later…

Window 1 Window 2 Window 3

Page 61: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 61

Lossy Counting in Action ...Lossy Counting in Action ...

Empty

FrequencyCounts

At window boundary, decrement all counters by 1

+

First Window

Page 62: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 62

Lossy Counting (continued)Lossy Counting (continued)FrequencyCounts

At window boundary, decrement all counters by 1

Next Window

+

Page 63: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 63

Error AnalysisError Analysis

If current size of stream = Nand window-size w = 1/ε

then # windows = εN

Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%

frequency error

How much do we undercount?

Page 64: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 64

Output: Elements with counter values exceeding (s-ε)N

Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N

Putting it all together…Putting it all together…

How many counters do we need?

Worst case bound: 1/ε log εN counters

Implementation details…

Page 65: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 65

Number of Counters?Number of Counters? Window size w = 1/

Number of windows m = N

ni – # counters alive over last i windows

Fact:

Claim:

Counter must average 1 increment/window to survive

# active counters

m1,2,...,jforjwinj

1ii

m1,2,...,jfori

wn

j

1i

j

1ii

εN logε

1m logw

i

wn

m

1i

m

1ii

Page 66: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 66

EnhancementsEnhancements

Frequency Errors For counter (X, c), true frequency in [c, c+εN]

Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1]

Batch Processing Decrements after k windows

If (t = 1), no error!

Page 67: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 67

Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling

Stream

Create counters by sampling Maintain exact counts thereafter

What is sampling rate?

341530

283141233519

Page 68: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 68

Sticky Sampling (continued)Sticky Sampling (continued)For finite stream of length N

Sampling rate = 2/εN log 1/s

Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%

Output: Elements with counter values exceeding (s-ε)N

Same error guarantees as Lossy Counting but probabilistic

Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N

= probability of failure

Page 69: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 69

Number of counters?Number of counters?

Finite stream of length NSampling rate: 2/εN log 1/s

Independent of N

Infinite stream with unknown NGradually adjust sampling rate

In either case,Expected number of counters = 2/ log 1/s

Page 70: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 70

References – SynopsesReferences – Synopses Synopsis data structures for massive data sets. Gibbons

and Matias, DIMACS 1999.

Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS 1999.

Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy.  SIGMOD 1999.

Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.

Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD 1999.

Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD 2001.

Page 71: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 71

References – SamplingReferences – Sampling Random Sampling with a Reservoir, Vitter. Transactions on

Mathematical Software 11(1):37-57 (1985).

On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999).

On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD 1999.

Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.

Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.

A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.

Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA 2002.

Sampling algorithms: lower bounds and applications. Bar-Yossef–Kumar–Sivakumar. STOC 2001.

Page 72: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 72

References – SketchesReferences – Sketches Probabilistic counting algorithms for data base applicatio

ns. Flajolet and Martin. JCSS (1985).

The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC 1996.

Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB 2002.

Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP 2002.

An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS 1999.

Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS  2000.