statistic estimation over data stream slides modified from minos garofalakis ( yahoo! research) and...

41
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Statistic estimationover data stream

Slides modified from Minos Garofalakis ( yahoo! research)

and S. Muthukrishnan (Rutgers University)

2

Outline

Introduction

Frequent moment estimation

Element Frequency estimation

3

Data Stream Processing Algorithms Generally, algorithms compute approximate answers

– Provably difficult to compute answers accurately with limited memory

Approximate answers - Deterministic bounds

– Algorithms only compute an approximate answer, but bounds on error

Approximate answers - Probabilistic bounds

– Algorithms compute an approximate answer with high probability

•With probability at least , the computed answer is within a factor of the actual answer

1

4

Sampling: Basics Idea: A small random sample S of the data often well-represents all

the data

– For a fast approximate answer, apply “modified” query to S

– Example: select agg from R (n=12)

– If agg is avg, return average of the elements in S

– Number of odd elements ?

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

Sample S: 9 5 1 8

answer: 11.5

5

Probabilistic Guarantees

Example: Actual answer is within 11.5 ± 1 with prob 0.9

Randomized algorithms: Answer returned is a specially-built random variable

Use Tail Inequalities to give probabilistic bounds on returned answer

– Markov Inequality

– Chebyshev’s Inequality

– Chernoff/Hoeffding Bound

6

Basic Tools: Tail Inequalities

General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)

Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any

Probabilitydistribution

Tail probability

0

Markov:

Chebyshev:22

][)|Pr(|

XVar

X

)Pr(X

7

Tail Inequalities for Sums Possible to derive even stronger bounds on tail probabilities for

the sum of independent Bernoulli trials

Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials such that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let and be the expectation of . Then, for any ,

Application to count queries:

– m is size of sample S (4 in example)

– p is fraction of odd elements in stream (2/3 in example)

2

2

exp2)|Pr(|

X

0

i iXX mpX

Do not need to compute Var(X), but need the independent assumption!

8

The Streaming Model

Underlying signal: One-dimensional array A[1…N] with values A[i] all initially zero

–Multi-dimensional arrays as well (e.g., row-major)

Signal is implicitly represented via a stream of updates

–j-th update is <k, c[j]> implying

• A[k] := A[k] + c[j] (c[j] can be >=0, <0)

Goal: Compute functions on A[] subject to

–Small space

–Fast processing of updates

–Fast function computation

–…

9

Streaming Model: Special Cases

Time-Series Model

–Only j-th update updates A[j] (i.e., A[j] := c[j])

Cash-Register Model

– c[j] is always >= 0 (i.e., increment-only)

–Typically, c[j]=1, so we see a multi-set of items in one pass

Turnstile Model

–Most general streaming model

– c[j] can be >=0 or <0 (i.e., increment or decrement)

Problem difficulty varies depending on the model

–E.g., MIN/MAX in Time-Series vs. Turnstile!

10

Frequent moment computation

Problem•Data arrives online ( a1,a2,a3…..am )

•Let f(i)=|{ j | aj = i }| ( represented by ||A[i]|| )

Example

Data stream: 3, 1, 2, 4, 2, 3, 5, . . .

f(1) f(2) f(3) f(4) f(5)

11 1

2 2

a ∈{ 1 , 2 ,..., n } i

n k F f

k ∑i i 1

F0 = 5 < distinct elements>, F1 = 7,F2 = 11 ( 1*1+2*2+2*2+1*1+1*1) ( surprise index)What is F∞?

11

Frequent moment computation

Easy for F1

How about others ?

- focus on the F2 and F0

- Estimation of Fk

12

Linear-Projection (AMS) Sketch Synopses

Goal:Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values

Basic Construct:Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector

– Simple to compute over the stream: Add whenever the i-th value is seen

– Generate ‘s in small O(logN) space using pseudo-random generators

– Tunable probabilistic guarantees on approximation error

– Delete-Proof: Just subtract to delete an i-th value occurrence

Data stream: 3, 1, 2, 4, 2, 3, 5, . . .

Data stream: 3, 1, 2, 4, 2, 3, 5, . . . 54321 22

f(1) f(2) f(3) f(4) f(5)

11 1

2 2

iiff )(, where = vector of random values from an appropriate distribution

i

i

i

13

AMS ( sketch ) cont.

Key Intuition: Use randomized linear projections of f() to define random variable X such that– X is easily computed over the stream (in small space)

– E[X] = F2

– Var[X] is small

Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables

– Pr[ = +1] = Pr[ = -1] = 1/2

• Expected value of each , E[ ] = ? E[ ] = ?

– Variables are 4-wise independent

• Expected value of product of 4 distinct = 0 E( ) = 0

– Variables can be generated using pseudo-random generator using only O(log N) space (for seeding)!

Probabilistic error guarantees

(e.g., actual answer is 10±1 with probability 0.9)

i ii i

ii

i

2i

1 2 3 4

14

Suppose { } :

1) 1,2 {1}, 3,4 {-1} then Z = ?

2) 4 {1}, 1,3,4 {-1} then Z = ?

AMS ( sketch ) cont.

Example

i

2

Data stream R : 4 1 2 4 1 4 10

21 3 4

:f(i)

4ZZ 421 32 Z

3

2Z X

15

AMS ( sketch ) cont.

Expected value of X = F2

Using 4-wise independence, possible to show that

222F 22 E(X)-)E(XVar[X]

])f(i)E[(E(X) 2

i i ])f(i'f(i)2E[]f(i)E[ i'i'i i

2

i i2

i

2f(i)

01

1 and 0 )E()E( 2ii

22

i

42 )f(i'f(i)f(i))E(X 6

16

Boosting Accuracy

Chebyshev’s Inequality:

Boost accuracy to by averaging over several independent copies of X (reduces variance)

By Chebyshev:

E[X]E[Y]

22 E[X] εVar[X]

εE[X])|E[X]-XPr(|

ε

x x x Average y

copies ε

16s 2

8F ε

sVar[X]

Var[Y]2

22

81

FVar[Y]

F |F-YPr(| 22

222

)

17

Boosting Confidence Boost confidence to by taking median of 2log(1/ ) independent copies of Y

Each Y = Bernoulli Trialδ1 δ

Pr[|median(Y)-F2| F2]ε

δ (By Chernoff Bound)

= Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ]δδ

y

y

ycopies2ε)F(1 2ε)F(1

2F

medianδ1Pr

1/8Pr

δ2log(1/ )

““FAILURE”:FAILURE”:

18

Step 1: Compute random variables:

Step 2: Define X= Z2

Steps 3 & 4: Average independent copies of X; Return median of averages

Main Theorem : Sketching approximates F2 to within a relative error

of with probability using space

– Remember: O(log N) space for “seeding” the construction of each X

Summary of AMS Sketching for F2

i if(i)Z

2ε8

x x x Average y

x x x Average y

x x x Average y

copies

copies median

δ1ε

) ε

logN)log(1/O( 2

δ2log(1/ )

19

Binary-Join COUNT Query

Problem: Compute answer for the query COUNT(R A S)

Example:

Exact solution: too expensive, requires O(N) space!

– N = sizeof(domain(A))

Data stream R.A: 4 1 2 4 1 4 12

0

3

21 3 4

:(i)fR

Data stream S.A: 3 1 2 4 2 4 12

21 3 4

:(i)fS2

1

i SRA (i)f(i)fS) COUNT(R

= 10 (2 + 2 + 0 + 6)

20

Basic AMS Sketching Technique [AMS96]

Key Intuition: Use randomized linear projections of f() to define random variable X such that– X is easily computed over the stream (in small space)

– E[X] = COUNT(R A S)

– Var[X] is small

Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables

N}1,...,i:{ i

21

AMS Sketch Construction

Compute random variables: and

– Simply add to XR(XS) whenever the i-th value is observed in

the R.A (S.A) stream

Define X = XRXS to be estimate of COUNT query

Example:

i iRR (i)fX

i iSS (i)fX

i

Data stream R.A: 4 1 2 4 1 4

Data stream S.A: 3 1 2 4 2 4

12

0

21 3 4

:(i)fR

12

21 3 4

:(i)fS2

1

4RR XX

1SS XX

421R 32X

3

4221S 2X 2

22

Binary-Join AMS Sketching Analysis

Expected value of X = COUNT(R A S)

Using 4-wise independence, possible to show that

is self-join size of R (second/L2 moment)

SJ(S) SJ(R)2Var[X]

i

2R(i)f SJ(R)

]XE[XE[X] SR

](i)f(i)fE[i iSi iR

])(i'f(i)fE[](i)f(i)fE[ i'i'i iSR

2

i iSR

i SR (i)f(i)f

01

23

Boosting Accuracy

Chebyshev’s Inequality:

Boost accuracy to by averaging over several independent copies of X (reduces variance)

By Chebyshev:

S) COUNT(RE[X]E[Y]

22 E[X] εVar[X]

εE[X])|E[X]-XPr(|

81

COUNT εVar[Y]

COUNT)ε|COUNT-YPr(| 22

ε

x x x Average y

copiesCOUNT ε

SJ(S))SJ(R)(28s 22

8COUNT ε

sVar[X]

Var[Y]22

24

Boosting Confidence Boost confidence to by taking median of 2log(1/ ) independent copies of Y

Each Y = Bernoulli Trialδ1 δ

Pr[|median(Y)-COUNT| COUNT]ε

δ (By Chernoff Bound)

= Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ]δδ

y

y

ycopiesε)COUNT(1 ε)COUNT(1COUNT

medianδ1Pr

1/8Pr

δ2log(1/ )

““FAILURE”:FAILURE”:

25

Step 1: Compute random variables: and

Step 2: Define X= XRXS

Steps 3 & 4: Average independent copies of X; Return median of averages

Main Theorem (AGMS99): Sketching approximates COUNT to within a relative error of with probability using space

– Remember: O(log N) space for “seeding” the construction of each X

Summary of Binary-Join AMS Sketching

i iRR (i)fX

i iSS (i)fX

22 COUNT εSJ(S))SJ(R)28 (

x x x Average y

x x x Average y

x x x Average y

copies

copies median

δ1ε

)COUNT ε

logN)log(1/ SJ(S)SJ(R)O( 22

δ2log(1/ )

26

Distinct Value Estimation ( F0 )

Problem: Find the number of distinct values in a stream of values with domain [0,...,N-1]

– Zeroth frequency moment

– Statistics: number of species or classes in a population

– Important for query optimizers

– Network monitoring: distinct destination IP addresses, source/destination pairs, requested URLs, etc.

Example (N=64)

Hard problem for random sampling!

– Must sample almost the entire table to guarantee the estimate is within a factor of 10 with probability > 1/2, regardless of the estimator used!

Data stream: 3 0 5 3 0 1 7 5 1 0 3 7

Number of distinct values: 5

0F

27

Assume a hash function h(x) that maps incoming values x in [0,…, N-1] uniformly across [0,…, 2^L-1], where L = O(logN)

Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y

– A value x is mapped to lsb(h(x))

Maintain Hash Sketch = BITMAP array of L bits, initialized to 0

– For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1

Prob[ lsb(h(x) = i ] = ?

Hash (aka FM) Sketches for Distinct Value Estimation [FM85]

x = 5 h(x) = 101100 lsb(h(x)) = 2 0 0 0 001

BITMAP5 4 3 2 1 0

28

Hash (FM) Sketches for Distinct Value Estimation [FM85] By uniformity through h(x): Prob[ BITMAP[k]=1 ] =

– Assuming d distinct values: expect d/2 to map to BITMAP[0] , d/4 to map to BITMAP[1], . . .

Let R = position of rightmost zero in BITMAP

– Use as indicator of log(d)

[FM85] prove that E[R] = , where

– Estimate d =

– Average several iid instances (different hash functions) to reduce estimator variance

fringe of 0/1s around log(d)

0 0 0 00 1

BITMAP

0 00 111 1 11111

position << log(d)

position >> log(d)

)log( d 7735.R2

12

1k

0L-1

29

Accuracy of FM

)1

log1

(2

O

)1

log1

log(log

mO

0 0 0 00BITMAP 1

0 10 111 1

0 0 0 10 0 11 111 1

0 0 0 10 0 11 111 1 BITMAP m

0

Approximation with probability at least 1-

30

[FM85] assume “ideal” hash functions h(x) (N-wise independence)

– In practice

• h(x) = , where a, b are random binary vectors in [0,…,2^L-1]

Composable: Component-wise OR/add distributed sketches together

– Estimate |S1 S2 … Sk| = set-union cardinality

Nbxa mod)(

Hash (FM) Sketches for Distinct Value Estimation

31

Cash Register Sketch (AMS)

• Choose random p from 1..n and let Stream

sampling | r | { q : q ≥p , a a } q p

Estimator k k X m ( r −( r −1 ) )

• Using F2 ( k=2 ) as example

Data stream: 3, 1, 2, 4, 2, 3, 5, . . .

If we choose the first element a1 r = 2 and X = 7*(2*2-1*1) = 21And for a2 r = ? , X= ? a5 r = ? , X= ?

• A more general algorithm for Fk

32

Cash Register Sketch (AMS)

Y=Average A copies of X and Z = median of B copies

Of Y’s

x x x Average y

x x x Average y

x x x Average y

B copies median

• Claim: This is a 1 +ε approx to F2, and space used is

O(AB) = words of size O(logn+logm)

with probability at least 1-δ.

)

1log

(2

n

O

A copies

33

Analysis: Cash Register Sketch

• E(X) = F2

• V(X) = E(X)2 - (E(X))2.

• Using (a2 - b2) ≤ 2(a-b)a, we have V(X) ≤ 2 F1F3. • Also, V(X) ≤ 2 F1 F3 ≤ . Hence,

E ( Y ) E ( X ) F i i 2

V ( Y ) V ( X ) / A ≤i i AFn 2

22

22Fn

34

Analysis Contd.

• Applying Chebyshev’s inequality

• Hence, by Chernoff bounds, probability that more than B/2 Yi’s deviate by far is at most δ, if we take log (1/δ) of Yi’s. Hence, median gives the correct

approximation.

22

2i

22

)V(Y )|Pr(|F

FFYi

8

1

22

22

22 FA

Fn

35

Computation of Fk

E(X) = Fk

When A =

B =

Get approximation with probability at least 1 -

2

118

kkn

r | { q : q ≥p , a a }| q p

k k X m ( r −( r −1 ) )

)1

log(2

36

Estimate the element frequency

Ask for f(1) = ? f(4) = ?

- AMS based algorithm

- Count Min sketch.

Data stream: 3, 1, 2, 4, 2, 3, 5, . . .

f(1) f(2) f(3) f(4) f(5)

11 1

2 2

37

AMS ( sketch ) based algorithm.

Key Intuition: Use randomized linear projections of f() to define random variable Z such thatFor given element A[i]

E( Z ) = ||A[i]|| = fi

Similar, we have E( Z ) = fj

Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables

( same as before )

Pr[ = +1] = Pr[ = -1] = ½

Let Z =

So E( Z )

i i

i

iiff )(,

i ])f(i'E[f(i) E[ ii'i'iii ]

f(i)

j

10

38

AMS cont.

Keep an array of w ╳ d counters for Zij

Use d hash functions to map element x to [1..w]

ai ,

W

da

h1(a)

hd(a)

Z[i, hi(a)] +=

Est(fa) = median i (Z[i,hi(a)] )ai ,

39

The Count Min (CM) Sketch

Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation

Creates a small summary as an array of w ╳ d counters C

Use d hash functions to map element to [1..w]

W =

d =

e

)1

log(

40

CM Sketch Structure

Each element xi is mapped to one counter per row

C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 if deletion )

or +c[j] if income is <j, c[j]>

Estimate A[j] by taking mink C[k,hk(j)]

+1

+1

+1

+1

h1(xi)

hd(xi )

xi d

w

41

CM Sketch Summary

CM sketch guarantees approximation error on point queries less than in size O(1/ log 1/)

– Probability of more error is less than 1-

Hints

– Counts are biased! Can you limit the expected amount of extra “mass” at each bucket? (Use Markov)