cs 361a1 cs 361a (advanced data structures and algorithms) lectures 16 & 17 (nov 16 and 28,...
Post on 21-Dec-2015
221 views
TRANSCRIPT
CS 361A 1
CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)
Lectures 16 & 17 (Nov 16 and 28, 2005)
Synopses, Samples, and Sketches
Rajeev Motwani
CS 361A 2
Game Plan for WeekGame Plan for WeekLast Class
Models for Streaming/Massive Data Sets
Negative results for Exact Distinct Values
Hashing for Approximate Distinct Values
TodaySynopsis Data Structures
Sampling Techniques
Frequency Moments Problem
Sketching Techniques
Finding High-Frequency Items
CS 361A 3
Synopsis Data StructuresSynopsis Data Structures Synopses
Webster – a condensed statement or outline (as of a narrative or treatise)
CS 361A – succinct data structure that lets us answers queries efficiently
Synopsis Data Structures“Lossy” Summary (of a data stream)
Advantages – fits in memory + easy to communicate
Disadvantage – lossiness implies approximation error
Negative Results best we can do
Key Techniques – randomization and hashing
CS 361A 4
Numerical ExamplesNumerical ExamplesApproximate Query Processing [AQUA/Bell Labs]
Database Size – 420 MB
Synopsis Size – 420 KB (0.1%)
Approximation Error – within 10%
Running Time – 0.3% of time for exact query
Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 109 items
Synopsis Size – 1249 items
Approximation Error – within 1%
CS 361A 5
SynopsesSynopses Desidarata
Small Memory Footprint
Quick Update and Query
Provable, low-error guarantees
Composable – for distributed scenario
Applicability?General-purpose – e.g. random samples
Specific-purpose – e.g. distinct values estimator
Granularity?Per database – e.g. sample of entire table
Per distinct value – e.g. customer profiles
Structural – e.g. GROUP-BY or JOIN result samples
CS 361A 6
Examples of SynopsesExamples of SynopsesSynopses need not be fancy!
Simple Aggregates – e.g. mean/median/max/min
Variance?
Random Samples
Aggregates on small samples represent entire data
Leverage extensive work on confidence intervals
Random Sketches
structured samples
Tracking High-Frequency Items
CS 361A 7
Random SamplesRandom Samples
CS 361A 8
Types of SamplesTypes of Samples Oblivious sampling – at item level
o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]
Value-based sampling – e.g. distinct-value samples
Structured samples – e.g. join samplingNaïve approach – keep samples of each relation
Problem – sample-of-join ‡ join-of-samples
Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]
what if A sampled from L and B from R?
AABB
L R
AB
CS 361A 9
Basic ScenarioBasic ScenarioGoal maintain uniform sample of item-stream
Sampling Semantics?Coin flip
o select each item with probability po easy to maintaino undesirable – sample size is unbounded
Fixed-size sample without replacemento Our focus today
Fixed-size sample with replacemento Show – can generate from previous sample
Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]
CS 361A 10
Reservoir Sampling [Vitter]Reservoir Sampling [Vitter]Input – stream of items X1 , X2, X3, …
Goal – maintain uniform random sample S of size n (without replacement) of stream so far
Reservoir Sampling
Initialize – include first n elements in S
Upon seeing item Xt
o Add Xt to S with probability n/to If added, evict random previous item
CS 361A 11
AnalysisAnalysis Correctness?
Fact: At each instant, |S| = n
Theorem: At time t, any XiεS with probability n/t
Exercise – prove via induction on t
Efficiency?Let N be stream size
Remark: Verify this is optimal.
Naïve implementation N coin flips time O(N)
n
Nln n O )HHn(1
t
n S] toupdates E[# nN
Nt 1n
CS 361A 12
Improving EfficiencyImproving Efficiency
Random variable Jt – number jumped over after time t
Idea – generate Jt and skip that many items
Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
items inserted into sample S (where n=3)
J9=4J3=2
1)b(a2)1)(aa(abawhere
s1)(t
sn)1(t1
T
n-T1
T
n-11F(s)
st
1tT
st
1tT
CS 361A 13
AnalysisAnalysisNumber of calls to RANDOM()?
one per insertion into sample
this is optimal!
Generating Jt?Pick random number U ε [0,1]
Find smallest j such that U ≤ F(j)
How?o Linear scan O(N) time o Binary search with Newton’s interpolation
O(n2(1 + polylog N/n)) time
Remark – see paper for optimal algorithm
CS 361A 14
Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]
Sliding Window W – last w items in stream
Model – item Xt expires at time t+w
Why?
Applications may require ignoring stale data
Type of approximation
Only way to define JOIN over streams
Goal – Maintain uniform sample of size n of sliding window
CS 361A 15
Reservoir Sampling?Reservoir Sampling?
Observeany item in sample S will expire eventually
must replace with random item of current window
Problem
no access to items in W-S
storing entire window requires O(w) memory
OversamplingBacking sample B – select each item with probability
sample S – select n items from B at random
upon expiry in S replenish from B
Claim: n < |B| < n log w with high probability
w
wlogn θ
CS 361A 16
Index-Set ApproachIndex-Set Approach
Pick random index set I= { i1, … , in }, X{0,1, … , w-1}
Sample S – items Xi with i ε {i1, … , in} (mod w) in current window
ExampleSuppose – w=2, n=1, and I={1}
Then – sample is always Xi with odd i
Memory – only O(k)
ObserveS is uniform random sample of each windowBut sample is periodic (union of arithmetic progressions)Correlation across successive windows
ProblemsCorrelation may hurt in some applicationsSome data (e.g. time-series) may be periodic
CS 361A 17
Chain-Sample AlgorithmChain-Sample Algorithm Idea
Fix expiry problem in Reservoir Sampling
Advance planning for expiry of sampled items
Focus on sample size 1 – keep n independent such samples
Chain-SamplingAdd Xt to S with probability 1/min{t,w} – evict earlier sample
Initially – standard Reservoir Sampling up to time w
Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}
o Xt expires must replace from Wt+w
o At time r, save Xr and pre-select its own replacement building “chain” of potential replacements
Note – if evicting earlier sample, discard its “chain” as well
CS 361A 18
ExampleExample
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
CS 361A 19
Expectation for Chain-SampleExpectation for Chain-Sample
T(x) = E[chain length for Xt at time t+x]
E[chain length] = T(w) e 2.718
E[memory required for sample size n] = O(n)
1xforxi T(i)
w
11
1xfor1T(x)
CS 361A 20
Tail Bound for Chain-SampleTail Bound for Chain-Sample Chain = “hops” of total length at most w
Chain of h hops ordered (h+1)-partition of wh hops of total length less than w
plus, remainder
Each partition has probability w-h
Number of partitions:
h = O(log w) probability of a partition is O(w-c)
Thus – memory O(n log w) with high probability
h
h
ew
h
w
CS 361A 21
Comparison of AlgorithmsComparison of Algorithms
Chain-Sample beats Oversample:
Expected memory – O(n) vs O(n log w)
High-probability memory bound – both O(n log w)
Oversample may have sample size shrink below n!
Algorithm Expected High-Probability
Periodic O(n) O(n)
Oversample O(n log w) O(n log w)
Chain-Sample O(n) O(n log w)
CS 361A 22
SketchesSketchesandand
Frequency MomentsFrequency Moments
CS 361A 23
Generalized Stream ModelGeneralized Stream Model
Input Element (i,a)
a copies of domain-value i
increment to ith dimension of m by a
a need not be an integer
Negative value – captures deletions
Data stream: 2, 0, 1, 3, 1, 2, 4, . . .
m0 m1 m2 m3 m4
11 1
2 2
CS 361A 24
ExampleExample
m0 m1 m2 m3 m4
11 1
2 2
On seeing element (i,a) = (2,2)
m0 m1 m2 m3 m4
11 1
2
4On seeing element (i,a) = (1,-1)
m0 m1 m2 m3 m4
11 1
4
1
CS 361A 25
Frequency MomentsFrequency Moments Input Stream
values from U = {0,1,…,N-1}
frequency vector m = (m0,m1,…,mN-1)
Kth Frequency Moment Fk(m) = Σi mik
F0: number of distinct values (Lecture 15)
F1: stream size
F2: Gini index, self-join size, Euclidean norm
Fk: for k>2, measures skew, sometimes useful
F∞: maximum frequency
Problem – estimation in small space
Sketches – randomized estimators
CS 361A 26
Naive ApproachesNaive ApproachesSpace N – counter mi for each distinct value i
Space O(1)
if input sorted by i
single counter recycled when new i value appears
Goal
Allow arbitrary input
Use small (logarithmic) space
Settle for randomization/approximation
CS 361A 27
Sketching FSketching F22
Random Hash h(i): {0,1,…,N-1} {-1,1}
Define Zi =h(i)
Maintain X = Σi miZi
Easy for update streams (i,a) – just add aZi to X
Claim: X2 is unbiased estimator for F2
Proof: E[X2] = E[(Σi miZi)2]
= E[Σi mi2Zi
2] + E[Σi,jmimjZiZj]
= Σi mi2E[Zi
2] + Σi,jmimjE[Zi]E[Zj]
= Σi mi2 + 0 = F2
Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}
fromindependence
CS 361A 28
Estimation Error?Estimation Error? Chebyshev bound:
Define Y = X2 E[Y] = E[X2] = Σi mi2 = F2
Observe E[X4] = E[(ΣmiZi)4]
= E[Σmi4Zi
4]+4E[Σmimj3ZiZj
3]+6E[Σmi2mj
2Zi2Zj
2]
+12E[Σmimjmk2ZiZjZk
2]+24E[ΣmimjmkmlZiZjZkZl]
= Σmi4 + 6Σmi
2mj2
By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2
= [Σmi4+6Σmi
2mj2] – [Σmi
4+2Σmi2mj
2]
= 4Σmi2mj
2 ≤ 2E[X2]2 = 2F22
2 2 YEλ
YVarYλEYEYP
Why?
CS 361A 29
Estimation Error?Estimation Error? Chebyshev bound
P [relative estimation error >λ]
Problem – What if we want λ really small?
SolutionCompute s = 8/λ2 independent copies of X
Estimator Y = mean(Xi2)
Variance reduces by factor s
P [relative estimation error >λ]
22 YEλ
YVarYλEYEYP
222
2
22
λ
2
Fλ
2F
4
1
Fsλ
2F2
22
22
CS 361A 30
Boosting TechniqueBoosting Technique Algorithm A: Randomized λ-approximate estimator f
P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4
Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]
Boosting IdeaO(log1/ε) independent estimates from A(X)
Return median of estimates
Claim: P[median is λ-approximate] >1- ε Proof:
P[specific estimate is λ-approximate] = ¾
Bad event only if >50% estimates not λ-approximate
Binomial tail – probability less than ε
CS 361A 31
Overall Space RequirementOverall Space RequirementObserve
Let m = Σmi
Each hash needs O(log m)-bit counter
s = 8/λ2 hash functions for each estimator
O(log 1/ε) such estimators
Total O(λ-2 log 1/ε log m) bits
Question – Space for storing hash function?
CS 361A 32
Sketching ParadigmSketching Paradigm Random Sketch: inner product
frequency vector m = (m0,m1,…,mN-1)
random vector Z (currently, uniform {-1,1})
ObserveLinearity Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)
Ideal for distributed computing
Observe Suppose: Given i, can efficiently generate Zi
Then: can maintain sketch for update streams
Problemo Must generate Zi=h(i) on first appearance of io Need Ω(N) memory to store h explicitlyo Need Ω(N) random bits
i if(i)ZZf,
CS 361A 33
Two birds, One stoneTwo birds, One stone Pairwise Independent Z1,Z2, …, Zn
for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]
property E[ZiZk] = E[Zi].E[Zk]
Example – linear hash functionSeed S=<a,b> from [0..p-1], where p is prime
Zi = h(i) = ai+b (mod p)
Claim: Z1,Z2, …, Zn are pairwise independent
Zi=x and Zk=y x=ai+b (mod p) and y=ak+b (mod p)
fixing i, k, x, y unique solution for a, b
P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]
Memory/Randomness: n log p 2 log p
CS 361A 34
Wait a minute!Wait a minute! Doesn’t pairwise independence screw up proofs?
No – E[X2] calculation only has degree-2 terms
But – what about Var[X2]?
Need 4-wise independence
CS 361A 35
Application – Join-Size EstimationApplication – Join-Size Estimation
GivenJoin attribute frequencies f1 and f2
Join size = f1.f2
Define – X1 = f1.Z and X2 = f2.Z
Choose – Z as 4-wise independent & uniform {-1,1}
Exercise: Show, as before,
E[X1 X2] = f1.f2
Var[X1 X2] ≤ 2 (f1.f2)2
Hint: a.b ≤ |a|.|b|
CS 361A 36
Bounding Error ProbabilityBounding Error Probability Using s copies of X’s & taking their mean Y
Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2
≤ 2f12f2
2 / sλ2(f1.f2)2
= 2 / sλ2cos2 θ
Bounding error probability?Need – s > 2/λ2cos2θ
Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))
ProblemTo choose s – need a-priori lower bound on cos θ = f1.f2
What if cos θ really small?
CS 361A 37
Sketch PartitioningSketch Partitioning
dom(R1.A)
10
12
10
dom(R2.B)
10 10
12
self-join(R1.A)*self-join(R2.B) = 205*205 = 42K
self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K
Idea for dealing with f12f2
2/(f1.f2)2 issue-- partition domain into regions whereself-join size is smaller to compensatesmall join-size (cos θ)
CS 361A 38
Sketch PartitioningSketch Partitioning
Idea
intelligently partition join-attribute space
need coarse statistics on stream
build independent sketches for each partition
Estimate = Σ partition sketches
Variance = Σ partition variances
CS 361A 39
Sketch PartitioningSketch Partitioning
Partition Space Allocation?
Can solve optimally, given domain partition
Optimal Partition: Find K-partition to minimize
Results
Dynamic Programming – optimal solution for single join
NP-hard – for queries with multiple joins
K
1
K
1i oin)size(selfJ]Var[X
CS 361A 40
FFkk for k > 2 for k > 2
Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)
Choose – random stream item ap p uniform from {1,2,…,m}
Suppose – ap = v ε {0,1,…,N-1}
Count subsequent frequency of v
r = | {q | q≥p, aq=v} |
Define X = m(rk – (r-1)k)
CS 361A 41
ExampleExampleStream
7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
m = 20
p = 9
ap = 5
r = 3
CS 361A 42
FFkk for k > 2 for k > 2
Var(X) ≤ kN1 – 1/k Fk2
Bounded Error Probability s = O(kN1 – 1/k / λ2)
Boosting memory bound
O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))
k
kn
kn
kkk
k2
k2
kkk
k1
k1
kkk
F
)]1)(m(m...)1(21
...
)1)(m(m...)1(21
)1)(m(m...)1(2[1m
mXE
Summing overm choices of
stream elements
CS 361A 43
Frequency MomentsFrequency MomentsF0 – distinct values problem (Lecture 15)
F1 – sequence lengthfor case with deletions, use Cauchy distribution
F2 – self-join size/Gini index (Today)
Fk for k >2omitting grungy details
can achieve space bound O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))
F∞ – maximum frequency
CS 361A 44
Communication ComplexityCommunication Complexity
Cooperatively compute function f(A,B) Minimize bits communicated
Unbounded computational power
Communication Complexity C(f) – bits exchanged by optimal protocol Π
Protocols?1-way versus 2-way
deterministic versus randomized
Cδ(f) – randomized complexity for error probability δ
ALICEinput A
BOBinput B
CS 361A 45
Streaming & Communication ComplexityStreaming & Communication Complexity
Stream Algorithm 1-way communication protocol
Simulation ArgumentGiven – algorithm S computing f over streams
Alice – initiates S, providing A as input stream prefix
Communicates to Bob – S’s state after seeing A
Bob – resumes S, providing B as input stream suffix
Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)
CS 361A 46
Example: Set DisjointnessExample: Set DisjointnessSet Disjointness (DIS)
A, B subsets of {1,2,…,N}
Output
Theorem: Cδ(DIS) = Ω(N), for any δ<1/2
φBA0
φBA1
CS 361A 47
Lower Bound for FLower Bound for F∞∞
Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with
P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ
needs Ω(N) space
ProofClaim: S 1-way protocol for DIS (on any sets A and B)
Alice streams set A to S
Communicates S’s state to Bob
Bob streams set B to S
Observe
Relative Error ε<1/3 DIS solved exactly!
P[error <½ ] < δ Ω(N) space
φBAif1
φBAif2F
CS 361A 48
ExtensionsExtensions Observe
Used only 1-way communication in proof
Cδ(DIS) bound was for arbitrary communication
Exercise – extend lower bound to multi-pass algorithms
Lower Bound for Fk, k>2
Need to increase gap beyond 2
Multiparty Set Disjointness – t players
Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with
P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ
needs Ω(N1-(2+ δ)/k) space
Implies Ω(N1/2) even for multi-pass algorithms
CS 361A 49
Tracking Tracking High-Frequency ItemsHigh-Frequency Items
CS 361A 50
Problem 1 – Top-K ListProblem 1 – Top-K List[Charikar-Chen-Farach-Colton]
The Google Problem
Return list of k most frequent items in stream
Motivation
search engine queries, network traffic, …
Remember
Saw lower bound recently!
Solution
Data structure Count-Sketch maintaining count-estimates of high-frequency elements
CS 361A 51
DefinitionsDefinitions Notation
Assume {1, 2, …, N} in order of frequency
mi is frequency of ith most frequent element
m = Σmi is number of elements in stream
FindCandidateTopInput: stream S, int k, int p
Output: list of p elements containing top k
Naive sampling gives solution with p = (m log k / mk)
FindApproxTopInput: stream S, int k, real
Output: list of k elements, each of frequency mi > (1-) mk
Naive sampling gives no solution
CS 361A 52
Main IdeaMain Idea Consider
single counter X
hash function h(i): {1, 2,…,N} {-1,+1}
Input element i update counter X += Zi = h(i)
For each r, use XZr as estimator of mr
Theorem: E[XZr] = mr Proof
X = Σi miZi
E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr
Cross-terms cancel
CS 361A 53
Finding Max Frequency ElementFinding Max Frequency Element Problem – var[X] = F2 = Σi mi
2
Idea – t counters, independent 4-wise hashes h1,…,ht
Use t = O(log m • mi2 / (m1)2)
Claim: New Variance < mi2 / t = (m1)2 / log m
Overall Estimatorrepeat + median of averages
with high probability, approximate m1
h1: i {+1, –1}
ht: i {+1, –1}
CS 361A 54
Problem with “Array of Counters”Problem with “Array of Counters”Variance – dominated by highest frequency
Estimates for less-frequent elements like k
corrupted by higher frequencies
variance >> mk
Avoiding Collisions?
spread out high frequency elements
replace each counter with hashtable of b counters
CS 361A 55
Count SketchCount Sketch Hash Functions
4-wise independent hashes h1,...,ht and s1,…,st
hashes independent of each other
Data structure: hashtables of counters X(r,c)
s1 : i {1, ..., b}
h1: i {+1, -1}
st : i {1, ..., b}
ht: i {+1, -1}
1 2 … b
CS 361A 56
Overall AlgorithmOverall Algorithm sr(i) – one of b counters in rth hashtable
Input i for each r, update X(r,sr(i)) += hr(i)
Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }
Maintain heap of k top elements seen so far
ObserveNot completely eliminated collision with high frequency items
Few of estimates X(r,sr(i)) • hr(i) could have high variance
Median not sensitive to these poor estimates
CS 361A 57
Avoiding Large ItemsAvoiding Large Items b > O(k) with probability Ω(1), no collision with top-k elements
t hashtables represent independent trials
Need log m/ trials to estimate with probability 1-
Also need – small variance for colliding small elements
Claim:
P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)
Final bound b = O(k + i>k mi2 / (mk)2)
CS 361A 58
Final ResultsFinal Results
Zipfian Distribution: mi 1/i [Power Law]
FindApproxTop
[k + (i>kmi2) / (mk)2] log m/
Roughly: sampling bound with frequencies squared
Zipfian – gives improved results
FindCandidateTop
Zipf parameter 0.5
O(k log N log m)
Compare: sampling bound O((kN)0.5 log k)
CS 361A 59
Problem 2 – Elephants-and-AntsProblem 2 – Elephants-and-Ants[Manku-Motwani]
Identify items whose current frequency exceeds support threshold s = 0.1%.[Jacobson 2000, Estan-Verghese 2001]
Stream
CS 361A 60
Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window-size w is function of support s – specify later…
Window 1 Window 2 Window 3
CS 361A 61
Lossy Counting in Action ...Lossy Counting in Action ...
Empty
FrequencyCounts
At window boundary, decrement all counters by 1
+
First Window
CS 361A 62
Lossy Counting (continued)Lossy Counting (continued)FrequencyCounts
At window boundary, decrement all counters by 1
Next Window
+
CS 361A 63
Error AnalysisError Analysis
If current size of stream = Nand window-size w = 1/ε
then # windows = εN
Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%
frequency error
How much do we undercount?
CS 361A 64
Output: Elements with counter values exceeding (s-ε)N
Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N
Putting it all together…Putting it all together…
How many counters do we need?
Worst case bound: 1/ε log εN counters
Implementation details…
CS 361A 65
Number of Counters?Number of Counters? Window size w = 1/
Number of windows m = N
ni – # counters alive over last i windows
Fact:
Claim:
Counter must average 1 increment/window to survive
# active counters
m1,2,...,jforjwinj
1ii
m1,2,...,jfori
wn
j
1i
j
1ii
εN logε
1m logw
i
wn
m
1i
m
1ii
CS 361A 66
EnhancementsEnhancements
Frequency Errors For counter (X, c), true frequency in [c, c+εN]
Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1]
Batch Processing Decrements after k windows
If (t = 1), no error!
CS 361A 67
Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling
Stream
Create counters by sampling Maintain exact counts thereafter
What is sampling rate?
341530
283141233519
CS 361A 68
Sticky Sampling (continued)Sticky Sampling (continued)For finite stream of length N
Sampling rate = 2/εN log 1/s
Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%
Output: Elements with counter values exceeding (s-ε)N
Same error guarantees as Lossy Counting but probabilistic
Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N
= probability of failure
CS 361A 69
Number of counters?Number of counters?
Finite stream of length NSampling rate: 2/εN log 1/s
Independent of N
Infinite stream with unknown NGradually adjust sampling rate
In either case,Expected number of counters = 2/ log 1/s
CS 361A 70
References – SynopsesReferences – Synopses Synopsis data structures for massive data sets. Gibbons
and Matias, DIMACS 1999.
Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS 1999.
Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD 1999.
Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.
Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD 1999.
Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD 2001.
CS 361A 71
References – SamplingReferences – Sampling Random Sampling with a Reservoir, Vitter. Transactions on
Mathematical Software 11(1):37-57 (1985).
On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999).
On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD 1999.
Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.
Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.
A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.
Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA 2002.
Sampling algorithms: lower bounds and applications. Bar-Yossef–Kumar–Sivakumar. STOC 2001.
CS 361A 72
References – SketchesReferences – Sketches Probabilistic counting algorithms for data base applicatio
ns. Flajolet and Martin. JCSS (1985).
The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC 1996.
Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB 2002.
Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP 2002.
An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS 1999.
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS 2000.