cs 361a1 cs 361a (advanced data structures and algorithms) lectures 16 & 17 (nov 16 and 28,...

CS 361A 1

CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)

Lectures 16 & 17 (Nov 16 and 28, 2005)

Synopses, Samples, and Sketches

Rajeev Motwani

CS 361A 2

Game Plan for WeekGame Plan for WeekLast Class

Models for Streaming/Massive Data Sets

Negative results for Exact Distinct Values

Hashing for Approximate Distinct Values

TodaySynopsis Data Structures

Sampling Techniques

Frequency Moments Problem

Sketching Techniques

Finding High-Frequency Items

CS 361A 3

Synopsis Data StructuresSynopsis Data Structures Synopses

Webster – a condensed statement or outline (as of a narrative or treatise)

CS 361A – succinct data structure that lets us answers queries efficiently

Synopsis Data Structures“Lossy” Summary (of a data stream)

Advantages – fits in memory + easy to communicate

Disadvantage – lossiness implies approximation error

Negative Results best we can do

Key Techniques – randomization and hashing

CS 361A 4

Numerical ExamplesNumerical ExamplesApproximate Query Processing [AQUA/Bell Labs]

Database Size – 420 MB

Synopsis Size – 420 KB (0.1%)

Approximation Error – within 10%

Running Time – 0.3% of time for exact query

Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,

Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 109 items

Synopsis Size – 1249 items

Approximation Error – within 1%

CS 361A 5

SynopsesSynopses Desidarata

Small Memory Footprint

Quick Update and Query

Provable, low-error guarantees

Composable – for distributed scenario

Applicability?General-purpose – e.g. random samples

Specific-purpose – e.g. distinct values estimator

Granularity?Per database – e.g. sample of entire table

Per distinct value – e.g. customer profiles

Structural – e.g. GROUP-BY or JOIN result samples

CS 361A 6

Examples of SynopsesExamples of SynopsesSynopses need not be fancy!

Simple Aggregates – e.g. mean/median/max/min

Variance?

Random Samples

Aggregates on small samples represent entire data

Leverage extensive work on confidence intervals

Random Sketches

structured samples

Tracking High-Frequency Items

CS 361A 7

Random SamplesRandom Samples

CS 361A 8

Types of SamplesTypes of Samples Oblivious sampling – at item level

o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]

Value-based sampling – e.g. distinct-value samples

Structured samples – e.g. join samplingNaïve approach – keep samples of each relation

Problem – sample-of-join ‡ join-of-samples

Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]

what if A sampled from L and B from R?

AABB

L R

AB

CS 361A 9

Basic ScenarioBasic ScenarioGoal maintain uniform sample of item-stream

Sampling Semantics?Coin flip

o select each item with probability po easy to maintaino undesirable – sample size is unbounded

Fixed-size sample without replacemento Our focus today

Fixed-size sample with replacemento Show – can generate from previous sample

Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]

CS 361A 10

Reservoir Sampling [Vitter]Reservoir Sampling [Vitter]Input – stream of items X1 , X2, X3, …

Goal – maintain uniform random sample S of size n (without replacement) of stream so far

Reservoir Sampling

Initialize – include first n elements in S

Upon seeing item Xt

o Add Xt to S with probability n/to If added, evict random previous item

CS 361A 11

AnalysisAnalysis Correctness?

Fact: At each instant, |S| = n

Theorem: At time t, any XiεS with probability n/t

Exercise – prove via induction on t

Efficiency?Let N be stream size

Remark: Verify this is optimal.

Naïve implementation N coin flips time O(N)

n

Nln n O )HHn(1

t

n S] toupdates E[# nN

Nt 1n

CS 361A 12

Improving EfficiencyImproving Efficiency

Random variable Jt – number jumped over after time t

Idea – generate Jt and skip that many items

Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

items inserted into sample S (where n=3)

J9=4J3=2

1)b(a2)1)(aa(abawhere

s1)(t

sn)1(t1

T

n-T1

T

n-11F(s)

st

1tT

st

1tT

CS 361A 13

AnalysisAnalysisNumber of calls to RANDOM()?

one per insertion into sample

this is optimal!

Generating Jt?Pick random number U ε [0,1]

Find smallest j such that U ≤ F(j)

How?o Linear scan O(N) time o Binary search with Newton’s interpolation

O(n2(1 + polylog N/n)) time

Remark – see paper for optimal algorithm

CS 361A 14

Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]

Sliding Window W – last w items in stream

Model – item Xt expires at time t+w

Why?

Applications may require ignoring stale data

Type of approximation

Only way to define JOIN over streams

Goal – Maintain uniform sample of size n of sliding window

CS 361A 15

Reservoir Sampling?Reservoir Sampling?

Observeany item in sample S will expire eventually

must replace with random item of current window

Problem

no access to items in W-S

storing entire window requires O(w) memory

OversamplingBacking sample B – select each item with probability

sample S – select n items from B at random

upon expiry in S replenish from B

Claim: n < |B| < n log w with high probability

w

wlogn θ

CS 361A 16

Index-Set ApproachIndex-Set Approach

Pick random index set I= { i1, … , in }, X{0,1, … , w-1}

Sample S – items Xi with i ε {i1, … , in} (mod w) in current window

ExampleSuppose – w=2, n=1, and I={1}

Then – sample is always Xi with odd i

Memory – only O(k)

ObserveS is uniform random sample of each windowBut sample is periodic (union of arithmetic progressions)Correlation across successive windows

ProblemsCorrelation may hurt in some applicationsSome data (e.g. time-series) may be periodic

CS 361A 17

Chain-Sample AlgorithmChain-Sample Algorithm Idea

Fix expiry problem in Reservoir Sampling

Advance planning for expiry of sampled items

Focus on sample size 1 – keep n independent such samples

Chain-SamplingAdd Xt to S with probability 1/min{t,w} – evict earlier sample

Initially – standard Reservoir Sampling up to time w

Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}

o Xt expires must replace from Wt+w

o At time r, save Xr and pre-select its own replacement building “chain” of potential replacements

Note – if evicting earlier sample, discard its “chain” as well

CS 361A 18

ExampleExample

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

CS 361A 19

Expectation for Chain-SampleExpectation for Chain-Sample

T(x) = E[chain length for Xt at time t+x]

E[chain length] = T(w) e 2.718

E[memory required for sample size n] = O(n)

1xforxi T(i)

w

11

1xfor1T(x)

CS 361A 20

Tail Bound for Chain-SampleTail Bound for Chain-Sample Chain = “hops” of total length at most w

Chain of h hops ordered (h+1)-partition of wh hops of total length less than w

plus, remainder

Each partition has probability w-h

Number of partitions:

h = O(log w) probability of a partition is O(w-c)

Thus – memory O(n log w) with high probability

h

h

ew

h

w

CS 361A 21

Comparison of AlgorithmsComparison of Algorithms

Chain-Sample beats Oversample:

Expected memory – O(n) vs O(n log w)

High-probability memory bound – both O(n log w)

Oversample may have sample size shrink below n!

Algorithm Expected High-Probability

Periodic O(n) O(n)

Oversample O(n log w) O(n log w)

Chain-Sample O(n) O(n log w)

CS 361A 22

SketchesSketchesandand

Frequency MomentsFrequency Moments

CS 361A 23

Generalized Stream ModelGeneralized Stream Model

Input Element (i,a)

a copies of domain-value i

increment to ith dimension of m by a

a need not be an integer

Negative value – captures deletions

Data stream: 2, 0, 1, 3, 1, 2, 4, . . .

m0 m1 m2 m3 m4

11 1

2 2

CS 361A 24

ExampleExample

m0 m1 m2 m3 m4

11 1

2 2

On seeing element (i,a) = (2,2)

m0 m1 m2 m3 m4

11 1

2

4On seeing element (i,a) = (1,-1)

m0 m1 m2 m3 m4

11 1

4

1

CS 361A 25

Frequency MomentsFrequency Moments Input Stream

values from U = {0,1,…,N-1}

frequency vector m = (m0,m1,…,mN-1)

Kth Frequency Moment Fk(m) = Σi mik

F0: number of distinct values (Lecture 15)

F1: stream size

F2: Gini index, self-join size, Euclidean norm

Fk: for k>2, measures skew, sometimes useful

F∞: maximum frequency

Problem – estimation in small space

Sketches – randomized estimators

CS 361A 26

Naive ApproachesNaive ApproachesSpace N – counter mi for each distinct value i

Space O(1)

if input sorted by i

single counter recycled when new i value appears

Goal

Allow arbitrary input

Use small (logarithmic) space

Settle for randomization/approximation

CS 361A 27

Sketching FSketching F22

Random Hash h(i): {0,1,…,N-1} {-1,1}

Define Zi =h(i)

Maintain X = Σi miZi

Easy for update streams (i,a) – just add aZi to X

Claim: X2 is unbiased estimator for F2

Proof: E[X2] = E[(Σi miZi)2]

= E[Σi mi2Zi

2] + E[Σi,jmimjZiZj]

= Σi mi2E[Zi

2] + Σi,jmimjE[Zi]E[Zj]

= Σi mi2 + 0 = F2

Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}

fromindependence

CS 361A 28

Estimation Error?Estimation Error? Chebyshev bound:

Define Y = X2 E[Y] = E[X2] = Σi mi2 = F2

Observe E[X4] = E[(ΣmiZi)4]

= E[Σmi4Zi

4]+4E[Σmimj3ZiZj

3]+6E[Σmi2mj

2Zi2Zj

2]

+12E[Σmimjmk2ZiZjZk

2]+24E[ΣmimjmkmlZiZjZkZl]

= Σmi4 + 6Σmi

2mj2

By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2

= [Σmi4+6Σmi

2mj2] – [Σmi

4+2Σmi2mj

2]

= 4Σmi2mj

2 ≤ 2E[X2]2 = 2F22

2 2 YEλ

YVarYλEYEYP

Why?

CS 361A 29

Estimation Error?Estimation Error? Chebyshev bound

P [relative estimation error >λ]

Problem – What if we want λ really small?

SolutionCompute s = 8/λ2 independent copies of X

Estimator Y = mean(Xi2)

Variance reduces by factor s

P [relative estimation error >λ]

22 YEλ

YVarYλEYEYP

222

2

22

λ

2

Fλ

2F

4

1

Fsλ

2F2

22

22

CS 361A 30

Boosting TechniqueBoosting Technique Algorithm A: Randomized λ-approximate estimator f

P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4

Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]

Boosting IdeaO(log1/ε) independent estimates from A(X)

Return median of estimates

Claim: P[median is λ-approximate] >1- ε Proof:

P[specific estimate is λ-approximate] = ¾

Bad event only if >50% estimates not λ-approximate

Binomial tail – probability less than ε

CS 361A 31

Overall Space RequirementOverall Space RequirementObserve

Let m = Σmi

Each hash needs O(log m)-bit counter

s = 8/λ2 hash functions for each estimator

O(log 1/ε) such estimators

Total O(λ-2 log 1/ε log m) bits

Question – Space for storing hash function?

CS 361A 32

Sketching ParadigmSketching Paradigm Random Sketch: inner product

frequency vector m = (m0,m1,…,mN-1)

random vector Z (currently, uniform {-1,1})

ObserveLinearity Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)

Ideal for distributed computing

Observe Suppose: Given i, can efficiently generate Zi

Then: can maintain sketch for update streams

Problemo Must generate Zi=h(i) on first appearance of io Need Ω(N) memory to store h explicitlyo Need Ω(N) random bits

i if(i)ZZf,

CS 361A 33

Two birds, One stoneTwo birds, One stone Pairwise Independent Z1,Z2, …, Zn

for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]

property E[ZiZk] = E[Zi].E[Zk]

Example – linear hash functionSeed S=<a,b> from [0..p-1], where p is prime

Zi = h(i) = ai+b (mod p)

Claim: Z1,Z2, …, Zn are pairwise independent

Zi=x and Zk=y x=ai+b (mod p) and y=ak+b (mod p)

fixing i, k, x, y unique solution for a, b

P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]

Memory/Randomness: n log p 2 log p

CS 361A 34

Wait a minute!Wait a minute! Doesn’t pairwise independence screw up proofs?

No – E[X2] calculation only has degree-2 terms

But – what about Var[X2]?

Need 4-wise independence

CS 361A 35

Application – Join-Size EstimationApplication – Join-Size Estimation

GivenJoin attribute frequencies f1 and f2

Join size = f1.f2

Define – X1 = f1.Z and X2 = f2.Z

Choose – Z as 4-wise independent & uniform {-1,1}

Exercise: Show, as before,

E[X1 X2] = f1.f2

Var[X1 X2] ≤ 2 (f1.f2)2

Hint: a.b ≤ |a|.|b|

CS 361A 36

Bounding Error ProbabilityBounding Error Probability Using s copies of X’s & taking their mean Y

Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2

≤ 2f12f2

2 / sλ2(f1.f2)2

= 2 / sλ2cos2 θ

Bounding error probability?Need – s > 2/λ2cos2θ

Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))

ProblemTo choose s – need a-priori lower bound on cos θ = f1.f2

What if cos θ really small?

CS 361A 37

Sketch PartitioningSketch Partitioning

dom(R1.A)

10

12

10

dom(R2.B)

10 10

12

self-join(R1.A)*self-join(R2.B) = 205*205 = 42K

self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K

Idea for dealing with f12f2

2/(f1.f2)2 issue-- partition domain into regions whereself-join size is smaller to compensatesmall join-size (cos θ)

CS 361A 38


Idea

intelligently partition join-attribute space

need coarse statistics on stream

build independent sketches for each partition

Estimate = Σ partition sketches

Variance = Σ partition variances

CS 361A 39


Partition Space Allocation?

Can solve optimally, given domain partition

Optimal Partition: Find K-partition to minimize

Results

Dynamic Programming – optimal solution for single join

NP-hard – for queries with multiple joins

K

1

K

1i oin)size(selfJ]Var[X

CS 361A 40

FFkk for k > 2 for k > 2

Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)

Choose – random stream item ap p uniform from {1,2,…,m}

Suppose – ap = v ε {0,1,…,N-1}

Count subsequent frequency of v

r = | {q | q≥p, aq=v} |

Define X = m(rk – (r-1)k)

CS 361A 41

ExampleExampleStream

7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8

m = 20

p = 9

ap = 5

r = 3

CS 361A 42

FFkk for k > 2 for k > 2

Var(X) ≤ kN1 – 1/k Fk2

Bounded Error Probability s = O(kN1 – 1/k / λ2)

Boosting memory bound

O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))

k

kn

kn

kkk

k2

k2

kkk

k1

k1

kkk

F

)]1)(m(m...)1(21

...

)1)(m(m...)1(21

)1)(m(m...)1(2[1m

mXE

Summing overm choices of

stream elements

CS 361A 43

Frequency MomentsFrequency MomentsF0 – distinct values problem (Lecture 15)

F1 – sequence lengthfor case with deletions, use Cauchy distribution

F2 – self-join size/Gini index (Today)

Fk for k >2omitting grungy details

can achieve space bound O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))

F∞ – maximum frequency

CS 361A 44

Communication ComplexityCommunication Complexity

Cooperatively compute function f(A,B) Minimize bits communicated

Unbounded computational power

Communication Complexity C(f) – bits exchanged by optimal protocol Π

Protocols?1-way versus 2-way

deterministic versus randomized

Cδ(f) – randomized complexity for error probability δ

ALICEinput A

BOBinput B

CS 361A 45

Streaming & Communication ComplexityStreaming & Communication Complexity

Stream Algorithm 1-way communication protocol

Simulation ArgumentGiven – algorithm S computing f over streams

Alice – initiates S, providing A as input stream prefix

Communicates to Bob – S’s state after seeing A

Bob – resumes S, providing B as input stream suffix

Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)

CS 361A 46

Example: Set DisjointnessExample: Set DisjointnessSet Disjointness (DIS)

A, B subsets of {1,2,…,N}

Output

Theorem: Cδ(DIS) = Ω(N), for any δ<1/2

φBA0

φBA1

CS 361A 47

Lower Bound for FLower Bound for F∞∞

Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with

P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ

needs Ω(N) space

ProofClaim: S 1-way protocol for DIS (on any sets A and B)

Alice streams set A to S

Communicates S’s state to Bob

Bob streams set B to S

Observe

Relative Error ε<1/3 DIS solved exactly!

P[error <½ ] < δ Ω(N) space

φBAif1

φBAif2F

CS 361A 48

ExtensionsExtensions Observe

Used only 1-way communication in proof

Cδ(DIS) bound was for arbitrary communication

Exercise – extend lower bound to multi-pass algorithms

Lower Bound for Fk, k>2

Need to increase gap beyond 2

Multiparty Set Disjointness – t players

Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with

P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ

needs Ω(N1-(2+ δ)/k) space

Implies Ω(N1/2) even for multi-pass algorithms

CS 361A 49

Tracking Tracking High-Frequency ItemsHigh-Frequency Items

CS 361A 50

Problem 1 – Top-K ListProblem 1 – Top-K List[Charikar-Chen-Farach-Colton]

The Google Problem

Return list of k most frequent items in stream

Motivation

search engine queries, network traffic, …

Remember

Saw lower bound recently!

Solution

Data structure Count-Sketch maintaining count-estimates of high-frequency elements

CS 361A 51

DefinitionsDefinitions Notation

Assume {1, 2, …, N} in order of frequency

mi is frequency of ith most frequent element

m = Σmi is number of elements in stream

FindCandidateTopInput: stream S, int k, int p

Output: list of p elements containing top k

Naive sampling gives solution with p = (m log k / mk)

FindApproxTopInput: stream S, int k, real

Output: list of k elements, each of frequency mi > (1-) mk

Naive sampling gives no solution

CS 361A 52

Main IdeaMain Idea Consider

single counter X

hash function h(i): {1, 2,…,N} {-1,+1}

Input element i update counter X += Zi = h(i)

For each r, use XZr as estimator of mr

Theorem: E[XZr] = mr Proof

X = Σi miZi

E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr

Cross-terms cancel

CS 361A 53

Finding Max Frequency ElementFinding Max Frequency Element Problem – var[X] = F2 = Σi mi

2

Idea – t counters, independent 4-wise hashes h1,…,ht

Use t = O(log m • mi2 / (m1)2)

Claim: New Variance < mi2 / t = (m1)2 / log m

Overall Estimatorrepeat + median of averages

with high probability, approximate m1

h1: i {+1, –1}

ht: i {+1, –1}

CS 361A 54

Problem with “Array of Counters”Problem with “Array of Counters”Variance – dominated by highest frequency

Estimates for less-frequent elements like k

corrupted by higher frequencies

variance >> mk

Avoiding Collisions?

spread out high frequency elements

replace each counter with hashtable of b counters

CS 361A 55

Count SketchCount Sketch Hash Functions

4-wise independent hashes h1,...,ht and s1,…,st

hashes independent of each other

Data structure: hashtables of counters X(r,c)

s1 : i {1, ..., b}

h1: i {+1, -1}

st : i {1, ..., b}

ht: i {+1, -1}

1 2 … b

CS 361A 56

Overall AlgorithmOverall Algorithm sr(i) – one of b counters in rth hashtable

Input i for each r, update X(r,sr(i)) += hr(i)

Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }

Maintain heap of k top elements seen so far

ObserveNot completely eliminated collision with high frequency items

Few of estimates X(r,sr(i)) • hr(i) could have high variance

Median not sensitive to these poor estimates

CS 361A 57

Avoiding Large ItemsAvoiding Large Items b > O(k) with probability Ω(1), no collision with top-k elements

t hashtables represent independent trials

Need log m/ trials to estimate with probability 1-

Also need – small variance for colliding small elements

Claim:

P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)

Final bound b = O(k + i>k mi2 / (mk)2)

CS 361A 58

Final ResultsFinal Results

Zipfian Distribution: mi 1/i [Power Law]

FindApproxTop

[k + (i>kmi2) / (mk)2] log m/

Roughly: sampling bound with frequencies squared

Zipfian – gives improved results

FindCandidateTop

Zipf parameter 0.5

O(k log N log m)

Compare: sampling bound O((kN)0.5 log k)

CS 361A 59

Problem 2 – Elephants-and-AntsProblem 2 – Elephants-and-Ants[Manku-Motwani]

Identify items whose current frequency exceeds support threshold s = 0.1%.[Jacobson 2000, Estan-Verghese 2001]

Stream

CS 361A 60

Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting

Step 1: Divide the stream into ‘windows’

Window-size w is function of support s – specify later…

Window 1 Window 2 Window 3

CS 361A 61

Lossy Counting in Action ...Lossy Counting in Action ...

Empty

FrequencyCounts

At window boundary, decrement all counters by 1

+

First Window

CS 361A 62

Lossy Counting (continued)Lossy Counting (continued)FrequencyCounts

At window boundary, decrement all counters by 1

Next Window

+

CS 361A 63

Error AnalysisError Analysis

If current size of stream = Nand window-size w = 1/ε

then # windows = εN

Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%

frequency error

How much do we undercount?

CS 361A 64

Output: Elements with counter values exceeding (s-ε)N

Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N

Putting it all together…Putting it all together…

How many counters do we need?

Worst case bound: 1/ε log εN counters

Implementation details…

CS 361A 65

Number of Counters?Number of Counters? Window size w = 1/

Number of windows m = N

ni – # counters alive over last i windows

Fact:

Claim:

Counter must average 1 increment/window to survive

# active counters

m1,2,...,jforjwinj

1ii

m1,2,...,jfori

wn

j

1i

j

1ii

εN logε

1m logw

i

wn

m

1i

m

1ii

CS 361A 66

EnhancementsEnhancements

Frequency Errors For counter (X, c), true frequency in [c, c+εN]

Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1]

Batch Processing Decrements after k windows

If (t = 1), no error!

CS 361A 67

Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling

Stream

Create counters by sampling Maintain exact counts thereafter

What is sampling rate?

341530

283141233519

CS 361A 68

Sticky Sampling (continued)Sticky Sampling (continued)For finite stream of length N

Sampling rate = 2/εN log 1/s

Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%

Output: Elements with counter values exceeding (s-ε)N

Same error guarantees as Lossy Counting but probabilistic

Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N

= probability of failure

CS 361A 69

Number of counters?Number of counters?

Finite stream of length NSampling rate: 2/εN log 1/s

Independent of N

Infinite stream with unknown NGradually adjust sampling rate

In either case,Expected number of counters = 2/ log 1/s

CS 361A 70

References – SynopsesReferences – Synopses Synopsis data structures for massive data sets. Gibbons

and Matias, DIMACS 1999.

Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS 1999.

Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD 1999.

Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.

Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD 1999.

Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD 2001.

CS 361A 71

References – SamplingReferences – Sampling Random Sampling with a Reservoir, Vitter. Transactions on

Mathematical Software 11(1):37-57 (1985).

On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999).

On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD 1999.

Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.

Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.

A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.

Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA 2002.

Sampling algorithms: lower bounds and applications. Bar-Yossef–Kumar–Sivakumar. STOC 2001.

http://www.acm.org/toms/V11.html

CS 361A 72

References – SketchesReferences – Sketches Probabilistic counting algorithms for data base applicatio

ns. Flajolet and Martin. JCSS (1985).

The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC 1996.

Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB 2002.

Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP 2002.

An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS 1999.

Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS 2000.

http://www.inria.fr/rrrt/rr-0313.html

http://www.inria.fr/rrrt/rr-0313.html

http://portal.acm.org/citation.cfm?id=237823&coll=portal&dl=ACM&CFID=13919071&CFTOKEN=90180537

http://portal.acm.org/citation.cfm?id=237823&coll=portal&dl=ACM&CFID=13919071&CFTOKEN=90180537

http://citeseer.nj.nec.com/manku02approximate.html

http://www.cs.princeton.edu/~moses/papers/frequent.ps

http://hake.stanford.edu/~datar/courses/cs361a/papers/fksv.pdf

http://hake.stanford.edu/~datar/courses/cs361a/papers/fksv.pdf

http://theory.lcs.mit.edu/~indyk/stream.ps

http://theory.lcs.mit.edu/~indyk/stream.ps

cs 361a1 cs 361a (advanced data structures and algorithms) lectures 16 & 17 (nov 16 and 28,...

Documents

cs 361a1 cs

cs 361a7 random samples

small samples

random samples aggregates

result samples

treatise cs

narasayya slide

hashing slide