statistic estimation over data stream slides modified from minos garofalakis ( yahoo! research) and...
Post on 21-Dec-2015
219 views
TRANSCRIPT
Statistic estimationover data stream
Slides modified from Minos Garofalakis ( yahoo! research)
and S. Muthukrishnan (Rutgers University)
3
Data Stream Processing Algorithms Generally, algorithms compute approximate answers
– Provably difficult to compute answers accurately with limited memory
Approximate answers - Deterministic bounds
– Algorithms only compute an approximate answer, but bounds on error
Approximate answers - Probabilistic bounds
– Algorithms compute an approximate answer with high probability
•With probability at least , the computed answer is within a factor of the actual answer
1
4
Sampling: Basics Idea: A small random sample S of the data often well-represents all
the data
– For a fast approximate answer, apply “modified” query to S
– Example: select agg from R (n=12)
– If agg is avg, return average of the elements in S
– Number of odd elements ?
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
Sample S: 9 5 1 8
answer: 11.5
5
Probabilistic Guarantees
Example: Actual answer is within 11.5 ± 1 with prob 0.9
Randomized algorithms: Answer returned is a specially-built random variable
Use Tail Inequalities to give probabilistic bounds on returned answer
– Markov Inequality
– Chebyshev’s Inequality
– Chernoff/Hoeffding Bound
6
Basic Tools: Tail Inequalities
General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)
Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any
Probabilitydistribution
Tail probability
0
Markov:
Chebyshev:22
][)|Pr(|
XVar
X
)Pr(X
7
Tail Inequalities for Sums Possible to derive even stronger bounds on tail probabilities for
the sum of independent Bernoulli trials
Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials such that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let and be the expectation of . Then, for any ,
Application to count queries:
– m is size of sample S (4 in example)
– p is fraction of odd elements in stream (2/3 in example)
2
2
exp2)|Pr(|
X
0
i iXX mpX
Do not need to compute Var(X), but need the independent assumption!
8
The Streaming Model
Underlying signal: One-dimensional array A[1…N] with values A[i] all initially zero
–Multi-dimensional arrays as well (e.g., row-major)
Signal is implicitly represented via a stream of updates
–j-th update is <k, c[j]> implying
• A[k] := A[k] + c[j] (c[j] can be >=0, <0)
Goal: Compute functions on A[] subject to
–Small space
–Fast processing of updates
–Fast function computation
–…
9
Streaming Model: Special Cases
Time-Series Model
–Only j-th update updates A[j] (i.e., A[j] := c[j])
Cash-Register Model
– c[j] is always >= 0 (i.e., increment-only)
–Typically, c[j]=1, so we see a multi-set of items in one pass
Turnstile Model
–Most general streaming model
– c[j] can be >=0 or <0 (i.e., increment or decrement)
Problem difficulty varies depending on the model
–E.g., MIN/MAX in Time-Series vs. Turnstile!
10
Frequent moment computation
Problem•Data arrives online ( a1,a2,a3…..am )
•Let f(i)=|{ j | aj = i }| ( represented by ||A[i]|| )
Example
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
f(1) f(2) f(3) f(4) f(5)
11 1
2 2
a ∈{ 1 , 2 ,..., n } i
n k F f
k ∑i i 1
F0 = 5 < distinct elements>, F1 = 7,F2 = 11 ( 1*1+2*2+2*2+1*1+1*1) ( surprise index)What is F∞?
11
Frequent moment computation
Easy for F1
How about others ?
- focus on the F2 and F0
- Estimation of Fk
12
Linear-Projection (AMS) Sketch Synopses
Goal:Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values
Basic Construct:Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector
– Simple to compute over the stream: Add whenever the i-th value is seen
– Generate ‘s in small O(logN) space using pseudo-random generators
– Tunable probabilistic guarantees on approximation error
– Delete-Proof: Just subtract to delete an i-th value occurrence
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
Data stream: 3, 1, 2, 4, 2, 3, 5, . . . 54321 22
f(1) f(2) f(3) f(4) f(5)
11 1
2 2
iiff )(, where = vector of random values from an appropriate distribution
i
i
i
13
AMS ( sketch ) cont.
Key Intuition: Use randomized linear projections of f() to define random variable X such that– X is easily computed over the stream (in small space)
– E[X] = F2
– Var[X] is small
Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables
– Pr[ = +1] = Pr[ = -1] = 1/2
• Expected value of each , E[ ] = ? E[ ] = ?
– Variables are 4-wise independent
• Expected value of product of 4 distinct = 0 E( ) = 0
– Variables can be generated using pseudo-random generator using only O(log N) space (for seeding)!
Probabilistic error guarantees
(e.g., actual answer is 10±1 with probability 0.9)
i ii i
ii
i
2i
1 2 3 4
14
Suppose { } :
1) 1,2 {1}, 3,4 {-1} then Z = ?
2) 4 {1}, 1,3,4 {-1} then Z = ?
AMS ( sketch ) cont.
Example
i
2
Data stream R : 4 1 2 4 1 4 10
21 3 4
:f(i)
4ZZ 421 32 Z
3
2Z X
15
AMS ( sketch ) cont.
Expected value of X = F2
Using 4-wise independence, possible to show that
222F 22 E(X)-)E(XVar[X]
])f(i)E[(E(X) 2
i i ])f(i'f(i)2E[]f(i)E[ i'i'i i
2
i i2
i
2f(i)
01
1 and 0 )E()E( 2ii
22
i
42 )f(i'f(i)f(i))E(X 6
16
Boosting Accuracy
Chebyshev’s Inequality:
Boost accuracy to by averaging over several independent copies of X (reduces variance)
By Chebyshev:
E[X]E[Y]
22 E[X] εVar[X]
εE[X])|E[X]-XPr(|
ε
x x x Average y
copies ε
16s 2
8F ε
sVar[X]
Var[Y]2
22
81
FVar[Y]
F |F-YPr(| 22
222
)
17
Boosting Confidence Boost confidence to by taking median of 2log(1/ ) independent copies of Y
Each Y = Bernoulli Trialδ1 δ
Pr[|median(Y)-F2| F2]ε
δ (By Chernoff Bound)
= Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ]δδ
y
y
ycopies2ε)F(1 2ε)F(1
2F
medianδ1Pr
1/8Pr
δ2log(1/ )
““FAILURE”:FAILURE”:
18
Step 1: Compute random variables:
Step 2: Define X= Z2
Steps 3 & 4: Average independent copies of X; Return median of averages
Main Theorem : Sketching approximates F2 to within a relative error
of with probability using space
– Remember: O(log N) space for “seeding” the construction of each X
Summary of AMS Sketching for F2
i if(i)Z
2ε8
x x x Average y
x x x Average y
x x x Average y
copies
copies median
δ1ε
) ε
logN)log(1/O( 2
δ2log(1/ )
19
Binary-Join COUNT Query
Problem: Compute answer for the query COUNT(R A S)
Example:
Exact solution: too expensive, requires O(N) space!
– N = sizeof(domain(A))
Data stream R.A: 4 1 2 4 1 4 12
0
3
21 3 4
:(i)fR
Data stream S.A: 3 1 2 4 2 4 12
21 3 4
:(i)fS2
1
i SRA (i)f(i)fS) COUNT(R
= 10 (2 + 2 + 0 + 6)
20
Basic AMS Sketching Technique [AMS96]
Key Intuition: Use randomized linear projections of f() to define random variable X such that– X is easily computed over the stream (in small space)
– E[X] = COUNT(R A S)
– Var[X] is small
Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables
N}1,...,i:{ i
21
AMS Sketch Construction
Compute random variables: and
– Simply add to XR(XS) whenever the i-th value is observed in
the R.A (S.A) stream
Define X = XRXS to be estimate of COUNT query
Example:
i iRR (i)fX
i iSS (i)fX
i
Data stream R.A: 4 1 2 4 1 4
Data stream S.A: 3 1 2 4 2 4
12
0
21 3 4
:(i)fR
12
21 3 4
:(i)fS2
1
4RR XX
1SS XX
421R 32X
3
4221S 2X 2
22
Binary-Join AMS Sketching Analysis
Expected value of X = COUNT(R A S)
Using 4-wise independence, possible to show that
is self-join size of R (second/L2 moment)
SJ(S) SJ(R)2Var[X]
i
2R(i)f SJ(R)
]XE[XE[X] SR
](i)f(i)fE[i iSi iR
])(i'f(i)fE[](i)f(i)fE[ i'i'i iSR
2
i iSR
i SR (i)f(i)f
01
23
Boosting Accuracy
Chebyshev’s Inequality:
Boost accuracy to by averaging over several independent copies of X (reduces variance)
By Chebyshev:
S) COUNT(RE[X]E[Y]
22 E[X] εVar[X]
εE[X])|E[X]-XPr(|
81
COUNT εVar[Y]
COUNT)ε|COUNT-YPr(| 22
ε
x x x Average y
copiesCOUNT ε
SJ(S))SJ(R)(28s 22
8COUNT ε
sVar[X]
Var[Y]22
24
Boosting Confidence Boost confidence to by taking median of 2log(1/ ) independent copies of Y
Each Y = Bernoulli Trialδ1 δ
Pr[|median(Y)-COUNT| COUNT]ε
δ (By Chernoff Bound)
= Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ]δδ
y
y
ycopiesε)COUNT(1 ε)COUNT(1COUNT
medianδ1Pr
1/8Pr
δ2log(1/ )
““FAILURE”:FAILURE”:
25
Step 1: Compute random variables: and
Step 2: Define X= XRXS
Steps 3 & 4: Average independent copies of X; Return median of averages
Main Theorem (AGMS99): Sketching approximates COUNT to within a relative error of with probability using space
– Remember: O(log N) space for “seeding” the construction of each X
Summary of Binary-Join AMS Sketching
i iRR (i)fX
i iSS (i)fX
22 COUNT εSJ(S))SJ(R)28 (
x x x Average y
x x x Average y
x x x Average y
copies
copies median
δ1ε
)COUNT ε
logN)log(1/ SJ(S)SJ(R)O( 22
δ2log(1/ )
26
Distinct Value Estimation ( F0 )
Problem: Find the number of distinct values in a stream of values with domain [0,...,N-1]
– Zeroth frequency moment
– Statistics: number of species or classes in a population
– Important for query optimizers
– Network monitoring: distinct destination IP addresses, source/destination pairs, requested URLs, etc.
Example (N=64)
Hard problem for random sampling!
– Must sample almost the entire table to guarantee the estimate is within a factor of 10 with probability > 1/2, regardless of the estimator used!
Data stream: 3 0 5 3 0 1 7 5 1 0 3 7
Number of distinct values: 5
0F
27
Assume a hash function h(x) that maps incoming values x in [0,…, N-1] uniformly across [0,…, 2^L-1], where L = O(logN)
Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y
– A value x is mapped to lsb(h(x))
Maintain Hash Sketch = BITMAP array of L bits, initialized to 0
– For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1
Prob[ lsb(h(x) = i ] = ?
Hash (aka FM) Sketches for Distinct Value Estimation [FM85]
x = 5 h(x) = 101100 lsb(h(x)) = 2 0 0 0 001
BITMAP5 4 3 2 1 0
28
Hash (FM) Sketches for Distinct Value Estimation [FM85] By uniformity through h(x): Prob[ BITMAP[k]=1 ] =
– Assuming d distinct values: expect d/2 to map to BITMAP[0] , d/4 to map to BITMAP[1], . . .
Let R = position of rightmost zero in BITMAP
– Use as indicator of log(d)
[FM85] prove that E[R] = , where
– Estimate d =
– Average several iid instances (different hash functions) to reduce estimator variance
fringe of 0/1s around log(d)
0 0 0 00 1
BITMAP
0 00 111 1 11111
position << log(d)
position >> log(d)
)log( d 7735.R2
12
1k
0L-1
29
Accuracy of FM
)1
log1
(2
O
)1
log1
log(log
mO
0 0 0 00BITMAP 1
0 10 111 1
0 0 0 10 0 11 111 1
0 0 0 10 0 11 111 1 BITMAP m
0
Approximation with probability at least 1-
30
[FM85] assume “ideal” hash functions h(x) (N-wise independence)
– In practice
• h(x) = , where a, b are random binary vectors in [0,…,2^L-1]
Composable: Component-wise OR/add distributed sketches together
– Estimate |S1 S2 … Sk| = set-union cardinality
Nbxa mod)(
Hash (FM) Sketches for Distinct Value Estimation
31
Cash Register Sketch (AMS)
• Choose random p from 1..n and let Stream
sampling | r | { q : q ≥p , a a } q p
Estimator k k X m ( r −( r −1 ) )
• Using F2 ( k=2 ) as example
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
If we choose the first element a1 r = 2 and X = 7*(2*2-1*1) = 21And for a2 r = ? , X= ? a5 r = ? , X= ?
• A more general algorithm for Fk
32
Cash Register Sketch (AMS)
Y=Average A copies of X and Z = median of B copies
Of Y’s
x x x Average y
x x x Average y
x x x Average y
B copies median
• Claim: This is a 1 +ε approx to F2, and space used is
O(AB) = words of size O(logn+logm)
with probability at least 1-δ.
)
1log
(2
n
O
A copies
33
Analysis: Cash Register Sketch
• E(X) = F2
• V(X) = E(X)2 - (E(X))2.
• Using (a2 - b2) ≤ 2(a-b)a, we have V(X) ≤ 2 F1F3. • Also, V(X) ≤ 2 F1 F3 ≤ . Hence,
E ( Y ) E ( X ) F i i 2
V ( Y ) V ( X ) / A ≤i i AFn 2
22
22Fn
34
Analysis Contd.
• Applying Chebyshev’s inequality
• Hence, by Chernoff bounds, probability that more than B/2 Yi’s deviate by far is at most δ, if we take log (1/δ) of Yi’s. Hence, median gives the correct
approximation.
22
2i
22
)V(Y )|Pr(|F
FFYi
8
1
22
22
22 FA
Fn
35
Computation of Fk
E(X) = Fk
When A =
B =
Get approximation with probability at least 1 -
2
118
kkn
r | { q : q ≥p , a a }| q p
k k X m ( r −( r −1 ) )
)1
log(2
36
Estimate the element frequency
Ask for f(1) = ? f(4) = ?
- AMS based algorithm
- Count Min sketch.
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
f(1) f(2) f(3) f(4) f(5)
11 1
2 2
37
AMS ( sketch ) based algorithm.
Key Intuition: Use randomized linear projections of f() to define random variable Z such thatFor given element A[i]
E( Z ) = ||A[i]|| = fi
Similar, we have E( Z ) = fj
Basic Idea:– Define a family of 4-wise independent {-1, +1} random variables
( same as before )
Pr[ = +1] = Pr[ = -1] = ½
Let Z =
So E( Z )
i i
i
iiff )(,
i ])f(i'E[f(i) E[ ii'i'iii ]
f(i)
j
10
38
AMS cont.
Keep an array of w ╳ d counters for Zij
Use d hash functions to map element x to [1..w]
ai ,
W
da
h1(a)
hd(a)
Z[i, hi(a)] +=
Est(fa) = median i (Z[i,hi(a)] )ai ,
39
The Count Min (CM) Sketch
Simple sketch idea, can be used for point queries ( fi), range queries, quantiles, join size estimation
Creates a small summary as an array of w ╳ d counters C
Use d hash functions to map element to [1..w]
W =
d =
e
)1
log(
40
CM Sketch Structure
Each element xi is mapped to one counter per row
C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 if deletion )
or +c[j] if income is <j, c[j]>
Estimate A[j] by taking mink C[k,hk(j)]
+1
+1
+1
+1
h1(xi)
hd(xi )
xi d
w