optimal approximations of the frequency moments of data streams piotr indyk david woodruff
TRANSCRIPT
Optimal Approximations of the Frequency Moments of Data
Streams
Piotr Indyk
David Woodruff
The Streaming Model
7113734 … Stream of elements a1, …, an each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm
Frequency Moments [AMS96]
n = stream size, m = universe size
fi = # occurrences of item i
Why are frequency moments important?
F0 = # of distinct elements F1 = n = stream size F2 = self-join size
k-th moment
Applications
Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations
F2 estimates size of self-joins:
Bob x
Alice y
Bob z
Bob a
Alice b
Bob c
,
Alice b y
Bob a x
Bob a z
Bob c x
Bob c z
Fk measures data skewness
fB2 + fA
2 = 4 + 1 = 5
The Best Deterministic Algorithm
Trivial algorithm for Fk
Store/update fi for each item i, sum fi
k at end
Space = O(mlog n): m items i, log n bits to count f i
Negative Results [AMS96]:
Compute Fk exactly (m) space
Any deterministic alg. outputs X with |Fk – X| < Fk must use (m) space
What about randomized algorithms?
Randomized Approx Algs for Fk
Randomized alg. -approximates Fk if outputs X s.t.
Pr[|Fk – X| < Fk ] > 2/3
Previous work (table suppresses polylog mn)
Upper Lower
F0 1/2 [FM85, GT02, BJKST02]
1/2 [IW03, W04]
F1 1 - 1 -
F2 1/2 [AMS96] 1/2 [W04]
Fk m1-1/(k-1) [CK04, G04] m1-2/k [BJKS02]
Matching Upper Bound
Our Contribution:
For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk
Additional Features:
1. Works even if we allow deletions, that is, stream of elements (i, +), (i,-)
2. Constant update time
Techniques
Our “algorithm’’ 1. Divide frequencies into “buckets” 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … 2. Estimate size si of each bucket 3. Output X = i si 2ik
Previous Algorithms [AMS96, CK04, G04]
1. Cleverly construct small-space estimator X s.t.
E[X] = Fk
Var[X] small
2. Apply Chebyshev’s inequality
What’s Left?
Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i)
Is this always easy? No.
Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96]
However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]
For the moment, let’s assume:
1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch)
2. We have a very long RAM of random bits
(we remove this using Nisan’s generator)
0 1 1 0 0 0 1 …
items
frequencyMax
Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p.
General Idea: Max + Sampling
7113734 …Random subset = {1, 3}
… 3 3 1 1
General Idea: Max + Sampling
What are chances the maximum lies in
Si = elements r such that fr 2 [2i-1, 2i)?
Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p.
q = (1-p) j > i sj ¢ (1 – (1-p)si)
Idea: 1. Estimate q as q’ by taking independent trials
and computing fraction of max in Si
2. If already estimated sj for j > i, solve this
expression for si.
When is this estimate any good?
Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si:
Need 1. (holds inductively)
2.
Requires 9 p so that q > 1/R, where
R = # trials used to estimate q
(tight concentration of q’)
When is this estimate any good?
Motivates the following:
Say a class Si contributes if and only if si > j > i sj /R
If R = (log n), then Fk ¼ contributing i si 2ik
q = (1-p)j > i sj (1 – (1-p)si)
p too large? ! q too small
p too small? ! q too small
The Idealized Algorithm 1. Use the random string to generate hash functions hj
r : [m] -> [2j] for j 2 [log m] and r 2 [R]
2. Restrict stream Str to Strjr, those items i with hj
r(i) = 1
3. For each Strjr, compute Max(Strj
r)
4. To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from
Si, and then set
5. Output F’k = i s’i 2ik
Removing the assumptions
[CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch
which, given stream Str, outputs all x for which fx2 ¸ F2/B
1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space
Lemma: If Si = [2i-1, 2i) contributes, then
Proof: Holder’s inequality.
Recall: Si contributes if and only if si > j > i sj /R
Removing the assumptions
2. We have an infinite string of random bits
Consider a space-S algorithm A and a functionf, with random strings R1, …, Rn that, when processing a stream, maintains a variableC, and updates as follows: C = C + f(i, Ri)
[Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and:1. The new algorithm A’ has space O~(S)
2. The outputs of A’ and A are indistinguishable
Our algorithm follows this framework
Conclusions
Result: Tight O~(m1-2/k) upper bound Handle deletions (j, -) O~(1) update time
Open Problem: Reduce O~ factors