optimal approximations of the frequency moments of data streams piotr indyk david woodruff

Optimal Approximations of the Frequency Moments of Data

Streams

Piotr Indyk

David Woodruff

The Streaming Model

7113734 … Stream of elements a1, …, an each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm

Frequency Moments [AMS96]

n = stream size, m = universe size

fi = # occurrences of item i

Why are frequency moments important?

F0 = # of distinct elements F1 = n = stream size F2 = self-join size

k-th moment

Applications

Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations

F2 estimates size of self-joins:

Bob x

Alice y

Bob z

Bob a

Alice b

Bob c

,

Alice b y

Bob a x

Bob a z

Bob c x

Bob c z

Fk measures data skewness

fB2 + fA

2 = 4 + 1 = 5

The Best Deterministic Algorithm

Trivial algorithm for Fk

Store/update fi for each item i, sum fi

k at end

Space = O(mlog n): m items i, log n bits to count f i

Negative Results [AMS96]:

Compute Fk exactly (m) space

Any deterministic alg. outputs X with |Fk – X| < Fk must use (m) space

What about randomized algorithms?

Randomized Approx Algs for Fk

Randomized alg. -approximates Fk if outputs X s.t.

Pr[|Fk – X| < Fk ] > 2/3

Previous work (table suppresses polylog mn)

Upper Lower

F0 1/2 [FM85, GT02, BJKST02]

1/2 [IW03, W04]

F1 1 - 1 -

F2 1/2 [AMS96] 1/2 [W04]

Fk m1-1/(k-1) [CK04, G04] m1-2/k [BJKS02]

Matching Upper Bound

Our Contribution:

For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk

Additional Features:

1. Works even if we allow deletions, that is, stream of elements (i, +), (i,-)

2. Constant update time

Techniques

Our “algorithm’’ 1. Divide frequencies into “buckets” 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … 2. Estimate size si of each bucket 3. Output X = i si 2ik

Previous Algorithms [AMS96, CK04, G04]

1. Cleverly construct small-space estimator X s.t.

E[X] = Fk

Var[X] small

2. Apply Chebyshev’s inequality

What’s Left?

Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i)

Is this always easy? No.

Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96]

However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]

For the moment, let’s assume:

1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch)

2. We have a very long RAM of random bits

(we remove this using Nisan’s generator)

0 1 1 0 0 0 1 …

items

frequencyMax

Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p.

General Idea: Max + Sampling

7113734 …Random subset = {1, 3}

… 3 3 1 1

General Idea: Max + Sampling

What are chances the maximum lies in

Si = elements r such that fr 2 [2i-1, 2i)?

Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p.

q = (1-p) j > i sj ¢ (1 – (1-p)si)

Idea: 1. Estimate q as q’ by taking independent trials

and computing fraction of max in Si

2. If already estimated sj for j > i, solve this

expression for si.

When is this estimate any good?

Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si:

Need 1. (holds inductively)

2.

Requires 9 p so that q > 1/R, where

R = # trials used to estimate q

(tight concentration of q’)

When is this estimate any good?

Motivates the following:

Say a class Si contributes if and only if si > j > i sj /R

If R = (log n), then Fk ¼ contributing i si 2ik

q = (1-p)j > i sj (1 – (1-p)si)

p too large? ! q too small

p too small? ! q too small

The Idealized Algorithm 1. Use the random string to generate hash functions hj

r : [m] -> [2j] for j 2 [log m] and r 2 [R]

2. Restrict stream Str to Strjr, those items i with hj

r(i) = 1

3. For each Strjr, compute Max(Strj

r)

4. To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from

Si, and then set

5. Output F’k = i s’i 2ik

Removing the assumptions

[CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch

which, given stream Str, outputs all x for which fx2 ¸ F2/B

1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space

Lemma: If Si = [2i-1, 2i) contributes, then

Proof: Holder’s inequality.

Recall: Si contributes if and only if si > j > i sj /R

Removing the assumptions

2. We have an infinite string of random bits

Consider a space-S algorithm A and a functionf, with random strings R1, …, Rn that, when processing a stream, maintains a variableC, and updates as follows: C = C + f(i, Ri)

[Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and:1. The new algorithm A’ has space O~(S)

2. The outputs of A’ and A are indistinguishable

Our algorithm follows this framework

Conclusions

Result: Tight O~(m1-2/k) upper bound Handle deletions (j, -) O~(1) update time

Open Problem: Reduce O~ factors

optimal approximations of the frequency moments of data streams piotr indyk david woodruff

Documents