maintaining stream statistics over sliding windows

36
Maintaining Stream Maintaining Stream Statistics Over Statistics Over Sliding Windows Sliding Windows Paper by Mayur Datar, Paper by Mayur Datar, Aristides Gionis, Piotr Aristides Gionis, Piotr Indyk, Rajeev Motwani Indyk, Rajeev Motwani Presentation by Adam Morrison. Presentation by Adam Morrison.

Upload: slone

Post on 25-Feb-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Maintaining Stream Statistics Over Sliding Windows. Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Presentation by Adam Morrison. Sliding Window Intro. Infinite stream. Only last N elements relevant. Packet streams. N is huge. Stronger model…. 1. 2. 3. 4. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Maintaining Stream Statistics Over Sliding Windows

Maintaining Stream Maintaining Stream Statistics Over Statistics Over

Sliding WindowsSliding WindowsPaper by Mayur Datar, Aristides Gionis, Paper by Mayur Datar, Aristides Gionis,

Piotr Indyk, Rajeev MotwaniPiotr Indyk, Rajeev Motwani

Presentation by Adam Morrison.Presentation by Adam Morrison.

Page 2: Maintaining Stream Statistics Over Sliding Windows

Sliding Window IntroSliding Window IntroInfinite stream.Infinite stream.Only last Only last NN elements relevant. elements relevant.

– Packet streams.Packet streams. NN is huge. is huge.

– Stronger model…Stronger model…

Page 3: Maintaining Stream Statistics Over Sliding Windows

ModelModelCount memory bits.Count memory bits.Online algorithm.Online algorithm.

Arrival:Arrival:

Timestamp:Timestamp:

11 22 33 44 55 66 77

3 2 13 2 13 2 13 2 13 2 13 2 1

Page 4: Maintaining Stream Statistics Over Sliding Windows

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

Page 5: Maintaining Stream Statistics Over Sliding Windows

Basic CountingBasic CountingExact Solution? (Counter?)Exact Solution? (Counter?)

22 11 11 11 0022Exact solution Exact solution

requires requires ((NN) bits.) bits.

Page 6: Maintaining Stream Statistics Over Sliding Windows

Approximate Basic CountingApproximate Basic Counting

Solution: Approximate the Solution: Approximate the answer and bound the answer and bound the relative relative errorerror

answer

answerapproxanswer

errorabs

1001009595 105105==0.050.05

Page 7: Maintaining Stream Statistics Over Sliding Windows

The ideaThe ideaDynamic histogram of active 1s.Dynamic histogram of active 1s.New 1s go into right most bucket.New 1s go into right most bucket.For each bucket keep the For each bucket keep the timestamptimestamp

of the most recent 1 and the bucket’s of the most recent 1 and the bucket’s sizesize..

When timestamp expires, free When timestamp expires, free bucket.bucket.

Bucket sizes?Bucket sizes?

Policy for creating new Policy for creating new buckets?buckets?

What is it good for?What is it good for?

Page 8: Maintaining Stream Statistics Over Sliding Windows

Example (Example (NN=4)=4)

Timestamp:Timestamp:

Size:Size:111111222222332244225522

1111

Page 9: Maintaining Stream Statistics Over Sliding Windows

9 10 11 12 13 14 9 10 11 12 13 14 1414 5 4 3 2 1 05 4 3 2 1 0

(Timestamps are easy)(Timestamps are easy)

9 10 11 12 13 14 0 9 10 11 12 13 14 0 00 6 5 4 3 2 1 06 5 4 3 2 1 0

Cyclic counter mod Cyclic counter mod NN..

N=N=1515

Page 10: Maintaining Stream Statistics Over Sliding Windows

What does the What does the histogram buy us?histogram buy us?

Active bucket Active bucket Contains an Contains an active 1.active 1.

Only the last bucket might Only the last bucket might contain expired 1s.contain expired 1s.

Page 11: Maintaining Stream Statistics Over Sliding Windows

Estimating number of 1sEstimating number of 1s

ConclusionConclusion::T T – sum of all bucket sizes but last.– sum of all bucket sizes but last.

– So there are at least So there are at least T T 1s.1s.CC – size of last bucket. – size of last bucket.

– Actual # of 1s can be anything Actual # of 1s can be anything from 1 to from 1 to CC..

21error Absolute

21 :Estimate

C

CT

Page 12: Maintaining Stream Statistics Over Sliding Windows

Absolute Absolute RelativeRelativeBucket sizes:Bucket sizes: mCC ,,1

True countTrue count )(11 TCC m

1

1

2/)1(counttrue

2/)1(relerr m

i i

mm

CCC

Page 13: Maintaining Stream Statistics Over Sliding Windows

Bounding the errorBounding the error

Goal: Relative error at most Goal: Relative error at most =1/=1/kk..

kC

Cj

i i

j 12/)1(1

1

If at all times we’d have that for If at all times we’d have that for all all jj,,

Page 14: Maintaining Stream Statistics Over Sliding Windows

How can we do that?How can we do that?(With as few buckets as (With as few buckets as

possible?)possible?)Non-decreasing bucket sizes.Non-decreasing bucket sizes.Bucket sizes constrained toBucket sizes constrained to

At most buckets of each size.At most buckets of each size.For all sizes but that of last bucket, For all sizes but that of last bucket,

at least buckets of each size.at least buckets of each size.

12log'and',2,,4,2,1 '

kNmmmm

12 k

2k

Exponential Exponential HistogramHistogram

Page 15: Maintaining Stream Statistics Over Sliding Windows

11

11

22

11

33

11

11

11

44

11

22

11

11

11

22

22

11

11

33

22

22

11

44

22

33

11

11

11

11

11

55

22

44

11

22

11

11

11

55

22

22

22

22

11

66

22

33

22

11

11

33

11

77

22

44

22

22

11

11

11

77

22

44

22

22

22

11

11

44

44

22

22

11

111

2

k

New 1 – create bucketNew 1 – create bucket

Too many buckets – mergeToo many buckets – merge

Check if invariant violated.Check if invariant violated.

T

Page 16: Maintaining Stream Statistics Over Sliding Windows

If there are at leastIf there are at leastbuckets of sizesbuckets of sizes

Why it works Why it works (correctness)(correctness)

rjC 2 2

k12,,2,1 r

11

1221

2

rj

i ikC

12

jCk

1

1

2/)1(1j

i i

j

C

Ck

Page 17: Maintaining Stream Statistics Over Sliding Windows

Why it works (space)Why it works (space)Can account for all 1s with justCan account for all 1s with just

NkNkkC m

rrm

i i

22

22

'

01

buckets.12log12

1

kNkm

Page 18: Maintaining Stream Statistics Over Sliding Windows

Space usageSpace usage

kNN 2logloglog

)2log(kNkO

)log( 2 NkOT T counter for estimation:counter for estimation: )(log NO

# of buckets:# of buckets:

Bucket size:Bucket size:

Page 19: Maintaining Stream Statistics Over Sliding Windows

OperationsOperationsEstimationEstimation: O(1): O(1)

)2(logkN

But only O(1) amortized!But only O(1) amortized!

InsertionInsertion: Cascading makes it : Cascading makes it worst case. worst case.

Bucket of size Bucket of size BB accounts for all accounts for all

operations related operations related to it: to it: BB inserts, inserts, BB--

1 merges (& 1 merges (& maybe delete).maybe delete).

pastpast

Sum of Sum of all all buckets in life buckets in life

time time (including (including

deleted) is deleted) is all all insertions.insertions.

Page 20: Maintaining Stream Statistics Over Sliding Windows

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

case]. worst )(log[ timeinsert amortized )1( ,query time )1(

space, )log( using Solved 2

NOOO

NkO

Page 21: Maintaining Stream Statistics Over Sliding Windows

Extending to Extending to SumSumIntegers in range [0, Integers in range [0, RR].].On value On value VV, insert , insert VV 1s. 1s.Timestamps:Timestamps:Bucket counter:Bucket counter:# of buckets:# of buckets:Total space:Total space:

RN logloglog )2log(

kNRkO

.log still N

Insertion

Insertion takes

takes (R)!(R)!

)log)log(log( NRNkO

Page 22: Maintaining Stream Statistics Over Sliding Windows

Reducing Reducing insertioninsertion time timeIf we had a way to rebuild the If we had a way to rebuild the

entire histogram…entire histogram…We could buffer new values…We could buffer new values…And rebuild histogram when And rebuild histogram when

buffer reaches size buffer reaches size BB..If it takes , If it takes ,

amortized is amortized is))log(log( RNkBO

))log(log1(B

RNkO

Picking Picking

givesgives

amortized time.amortized time.

))log(log( RNkB

)loglog(

NRO

Page 23: Maintaining Stream Statistics Over Sliding Windows

kk/2 canonical /2 canonical representationrepresentation

The The k/2 canonical representationk/2 canonical representation of of SS : :

jikkkkkS ii

j

i

ii

for 2

,12

,20

Would it Would it really?really?

Is this Is this representation representation

uniqueunique??

If If S S is the total size of the buckets, is the total size of the buckets, computing its computing its k/2 canonical representationk/2 canonical representation would help us rebuild the histogram.would help us rebuild the histogram.

Page 24: Maintaining Stream Statistics Over Sliding Windows

ij

iikS 2

0

ji

jii

kk )12(2

2

12/

2 k

SjFind the largest Find the largest j j for whichfor which

)12(2

' jkSS

22

k

jj=2=2=5=5

If findIf findjS 2'jj mSm 2)1('2

mk j

jj

j

kSS

bb

2''' oftion representa

binary the be ,,Let 10

=01=01

ii bkk 2

Total time Total time

required is required is

OO(log (log SS).).

Page 25: Maintaining Stream Statistics Over Sliding Windows

8 6 4 3 2 18 6 4 3 2 19 7 5 4 3 29 7 5 4 3 210 8 6 5 4 310 8 6 5 4 3

22111 S 02 S 22 S 72 S

Calculate Calculate SS11++SS22 representation: representation:

10 6 2 1 1 1 110 6 2 1 1 1 1

55

If a value gets If a value gets “unindexed”, it will “unindexed”, it will

never be indexed in the never be indexed in the future.future.

Page 26: Maintaining Stream Statistics Over Sliding Windows

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

case]. worst )(log[ einsert tim

amortized )loglog( ,query time )1(

space, )loglog( using Solved

NRONROO

NNRkO

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Page 27: Maintaining Stream Statistics Over Sliding Windows

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Lower boundsLower boundsBasic CountingBasic Counting and and SumSum

algorithms are optimal.algorithms are optimal.Similar techniques will show Similar techniques will show

that lots of other problems are that lots of other problems are intractable. (Later.)intractable. (Later.)

Page 28: Maintaining Stream Statistics Over Sliding Windows

Basic CountingBasic Counting bound boundNN

243 kBi2

Bi2

BN

kB

L

log

4

)(log16

log 2

kNkLNkB

Page 29: Maintaining Stream Statistics Over Sliding Windows

Big Big block block dd

)12(

4dk

dc2

Left most Left most such such

subblocksubblock 41,2)1( kcc d

kkk d

d

2)12(44

2err rel

2err abs1-d

1

Same idea works for Same idea works for SumSum..

Page 30: Maintaining Stream Statistics Over Sliding Windows

Randomized boundRandomized boundYao minimax principleYao minimax principle::Expected space complexity of Expected space complexity of

optimal algorithm for an input optimal algorithm for an input distribution is a lower bound on distribution is a lower bound on expected space complexity of expected space complexity of randomized algorithm.randomized algorithm.

Lower bound applies to Lower bound applies to randomized algorithms.randomized algorithms.

Page 31: Maintaining Stream Statistics Over Sliding Windows

TimestampsTimestampsDefine window based on real Define window based on real

time – equate timestamp with time – equate timestamp with clock.clock.

No work needs to be done No work needs to be done when items don’t arrive, so when items don’t arrive, so deletions can be deferred.deletions can be deferred.

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problemsIf much less than If much less than N N items can items can

arrive during the window, memory arrive during the window, memory usage is usage is reducedreduced..

Page 32: Maintaining Stream Statistics Over Sliding Windows

ApplicationsApplicationsAdapting algorithms to the sliding Adapting algorithms to the sliding

window model using window model using EHEH to to replace replace counters.counters.

Counters require bits, EH Counters require bits, EH takes .takes .

Also factor loss in accuracy.Also factor loss in accuracy.

NlogNk 2log

1

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Page 33: Maintaining Stream Statistics Over Sliding Windows

More ProblemsMore ProblemsMin/MaxMin/Max

– Storing subsequence of (say) Storing subsequence of (say) mins is optimal.mins is optimal.

Distinct valuesDistinct values– Basic CountingBasic Counting reduces to it. reduces to it.

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Page 34: Maintaining Stream Statistics Over Sliding Windows

Other ProblemsOther ProblemsDistinct values with deletions.Distinct values with deletions.

– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space.space.

– Map 1s in a bit string to distinct Map 1s in a bit string to distinct values. Pad with zeros to infer value values. Pad with zeros to infer value of last bit, then use deletion to of last bit, then use deletion to cancel that bit.cancel that bit.

– Repeat.Repeat.

Page 35: Maintaining Stream Statistics Over Sliding Windows

Other ProblemsOther ProblemsSum Sum with negative integers.with negative integers.

– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space. space.

– Maps 1s in bit string to (-1,1) and Maps 1s in bit string to (-1,1) and 0s to (1,-1).0s to (1,-1).

– Pad with 0s and query at odd Pad with 0s and query at odd time instants.time instants.

Page 36: Maintaining Stream Statistics Over Sliding Windows