space-efficient online computation of quantile summaries sigmod 01 michael greenwald & sanjeev...

Space-Efficient Online Computation of Quantile Summaries

SIGMOD 01

Michael Greenwald & Sanjeev Khanna

Presented by ellery

Outline

• Introduction• The summary data structure• Operation and algorithm• Tree representation• Analysis and experimental result• Conclusion

Introduction

• Space-efficient computation of quantile summaries of very large data sets in a single pass.

• Quantile queries: Given a quantile, , return the value whose rank is N

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

sorting

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12

N = 16

0.5 quantile returns element ranked 8 ( 0.5*16)

which is 8

0.75 quantile returns element ranked 12 (0.75*16)

which is 10

Requirements• Explicit & tunable a priori guarantees on the

precision of the approximation• As small a memory footprint as possible• Online: Single pass over the data• Data Independent Performance: guarantees

should be unaffected by arrival order, distribution of values, or cardinality of observations.

• Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).

ε- approximate

• A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ]

Example : A data stream with 100 elements,

0.5 – quantile with ε= 0.1 returns a value v.

The true rank of v is within [40,60]

The Summary Data Structure

• Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v

• Each tuple ti = (vi , gi ,Δi)

• • • •

1minmin iii vrvrg

iii vrvr minmax

ij ji gvrmin

iij ji gvr max

Example

.01, N=1750

192

{15,2}

201

{28,7}

204

{10,1}

[501,503] [529,536] [539,540]

Query

• Sketch S is ε- approximate, That is for each ψ (0,1] , there is a (vi , rmin

(vi) , rmax(vi) ) in S such that

• vi is our answer for ψ-quantile

N maxmin NvrvrNN ii

Corollary

• If at any time n, the summary S(n) satisfies the property that

then we can answer any ψ-quantile query to within an εn precision.

ng iii 2max

Overview of Summary Data Structure

• Quantile = .29? Compute r and choose best vi

192

[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1800

204[539,540]

{10,1}

= .29

r = N = 522


• If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary.

• Our goal: always maintain this property.• Tuple formulation of this rule: gi + I 2≦ N

192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1800

204[539,540]

{10,1}

2N=36


• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N

• Insert new observations into summary

192

[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1800

204[539,540]

{10,1}

1972N=36


192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1800

204[539,540]

{10,1}

197[502,536]

2N=36

• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N

• Insert new observations into summary


• Goal: always maintain -approximate summary

(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary

• Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1;

192[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1801

204[540,541]

{10,1}

197[502,536]

2N=36.02

{1,34}



(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries.

192

[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1801

204[540,541]

{10,1}

197[502,536]

2N=36.02

{1,34}



(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries.

192

[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1801

204[540,541]

{10,1}

2N=36.02

{1,34}



(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries. gi = gi + gi-1

192

[501,503]

{15,2}

201[530,537]

{29,7}.01, N=1801

204[540,541]

{10,1}

2N=36.02


• Insert: gnew = 1; new = gi + I - 1;

• Delete: gi = gi + gi-1

192[501,503]

{15,2}

201[530,537]

{29,7}.01, N=1801

204[540,541]

{10,1}

2N=36.02

Terminology

• Full tuple: A tuple is full if gi + I = 2N• Full tuple pair: A pair of tuples is full if del

eting the left-hand tuple would overfill the right one

• Capacity: number of observations that can be counted by gi before the tuple becomes full. (= 2N - I)

General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.

Operations

• Insert(v)： Find the smallest i, such that

, and insert• Delete(vi)： to delete from S,

replace and by the new tuple

• Compress()： from right to left, merge all mergeable pair.

ii vvv 1 1,1, iigv

iii gv ,,

111 ,, iii gv iii gv ,,

111 ,, iiii ggv

GK Algorithm

To add the n+1st observation, v, to summary S(n)

21mod0nyesno

COMPRESS() INSERT

Tree Representation.001, N=7,000

2N=14

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3

• Group tuples with similar capacities into bands

• First (least index) node to the right with higher capacity band becomes parent.


2N=14




0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3


2N=14




0 0 0 0 0 01 1 1 1 1 1 1 122 2

3 3 3 3


2N=14




0 0 0 0 0 0

1 1 1 1 1 1 1 122 2

3 3 3 3

R

Operation (compress)

General strategy: delete tuples with small capacity and preserve tuples with large capacity.

1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees

2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity.

3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))

Analysis

• Theorem

At any time n, the total number of tuples stored in S(n) is at most

n 2log2/11

Experimental Result• Measurement:

• |S|• Observed (vs. desired ) : max, avg, and for 16 r

epresentative quantiles• Optimal max observed

• Compared 3 algorithms• MRL• Preallocated (1/3 number of stored observations a

s MRL)• Adaptive: allocate a new quantile only when obser

ved error is about to exceed desired

Conclusion

• Better worst-case behavior than previous algorithms

• It does not require a priori knowledge of the parameter N

•

NONO log1log1 2

Any Question ?

space-efficient online computation of quantile summaries sigmod 01 michael greenwald & sanjeev...

Documents

approximate summary

n precision

time n

summary sn

r n example

approximatea quantile

interval r n

data stream