space-efficient online computation of quantile summaries sigmod 01 michael greenwald & sanjeev...
TRANSCRIPT
Space-Efficient Online Computation of Quantile Summaries
SIGMOD 01
Michael Greenwald & Sanjeev Khanna
Presented by ellery
Outline
• Introduction• The summary data structure• Operation and algorithm• Tree representation• Analysis and experimental result• Conclusion
Introduction
• Space-efficient computation of quantile summaries of very large data sets in a single pass.
• Quantile queries: Given a quantile, , return the value whose rank is N
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
sorting
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12
N = 16
0.5 quantile returns element ranked 8 ( 0.5*16)
which is 8
0.75 quantile returns element ranked 12 (0.75*16)
which is 10
Requirements• Explicit & tunable a priori guarantees on the
precision of the approximation• As small a memory footprint as possible• Online: Single pass over the data• Data Independent Performance: guarantees
should be unaffected by arrival order, distribution of values, or cardinality of observations.
• Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).
ε- approximate
• A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ]
Example : A data stream with 100 elements,
0.5 – quantile with ε= 0.1 returns a value v.
The true rank of v is within [40,60]
The Summary Data Structure
• Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v
• Each tuple ti = (vi , gi ,Δi)
• • • •
1minmin iii vrvrg
iii vrvr minmax
ij ji gvrmin
iij ji gvr max
Example
.01, N=1750
192
{15,2}
201
{28,7}
204
{10,1}
[501,503] [529,536] [539,540]
Query
• Sketch S is ε- approximate, That is for each ψ (0,1] , there is a (vi , rmin
(vi) , rmax(vi) ) in S such that
• vi is our answer for ψ-quantile
N maxmin NvrvrNN ii
Corollary
• If at any time n, the summary S(n) satisfies the property that
then we can answer any ψ-quantile query to within an εn precision.
ng iii 2max
Overview of Summary Data Structure
• Quantile = .29? Compute r and choose best vi
192
[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1800
204[539,540]
{10,1}
= .29
r = N = 522
Overview of Summary Data Structure
• If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary.
• Our goal: always maintain this property.• Tuple formulation of this rule: gi + I 2≦ N
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1800
204[539,540]
{10,1}
2N=36
Overview of Summary Data Structure
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
• Insert new observations into summary
192
[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1800
204[539,540]
{10,1}
1972N=36
Overview of Summary Data Structure
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1800
204[539,540]
{10,1}
197[502,536]
2N=36
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N
• Insert new observations into summary
Overview of Summary Data Structure
• Goal: always maintain -approximate summary
(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary
• Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1;
192[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1801
204[540,541]
{10,1}
197[502,536]
2N=36.02
{1,34}
Overview of Summary Data Structure
• Goal: always maintain -approximate summary
(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries.
192
[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1801
204[540,541]
{10,1}
197[502,536]
2N=36.02
{1,34}
Overview of Summary Data Structure
• Goal: always maintain -approximate summary
(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries.
192
[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1801
204[540,541]
{10,1}
2N=36.02
{1,34}
Overview of Summary Data Structure
• Goal: always maintain -approximate summary
(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N• Insert new observations into summary• Delete all “superfluous” entries. gi = gi + gi-1
192
[501,503]
{15,2}
201[530,537]
{29,7}.01, N=1801
204[540,541]
{10,1}
2N=36.02
Overview of Summary Data Structure
• Insert: gnew = 1; new = gi + I - 1;
• Delete: gi = gi + gi-1
192[501,503]
{15,2}
201[530,537]
{29,7}.01, N=1801
204[540,541]
{10,1}
2N=36.02
Terminology
• Full tuple: A tuple is full if gi + I = 2N• Full tuple pair: A pair of tuples is full if del
eting the left-hand tuple would overfill the right one
• Capacity: number of observations that can be counted by gi before the tuple becomes full. (= 2N - I)
General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.
Operations
• Insert(v): Find the smallest i, such that
, and insert• Delete(vi): to delete from S,
replace and by the new tuple
• Compress(): from right to left, merge all mergeable pair.
ii vvv 1 1,1, iigv
iii gv ,,
111 ,, iii gv iii gv ,,
111 ,, iiii ggv
GK Algorithm
To add the n+1st observation, v, to summary S(n)
21mod0nyesno
COMPRESS() INSERT
Tree Representation.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3
• Group tuples with similar capacities into bands
• First (least index) node to the right with higher capacity band becomes parent.
Tree Representation.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
• Group tuples with similar capacities into bands
• First (least index) node to the right with higher capacity band becomes parent.
0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3
Tree Representation.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
• Group tuples with similar capacities into bands
• First (least index) node to the right with higher capacity band becomes parent.
0 0 0 0 0 01 1 1 1 1 1 1 122 2
3 3 3 3
Tree Representation.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
• Group tuples with similar capacities into bands
• First (least index) node to the right with higher capacity band becomes parent.
0 0 0 0 0 0
1 1 1 1 1 1 1 122 2
3 3 3 3
R
Operation (compress)
General strategy: delete tuples with small capacity and preserve tuples with large capacity.
1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees
2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity.
3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))
Analysis
• Theorem
At any time n, the total number of tuples stored in S(n) is at most
n 2log2/11
Experimental Result• Measurement:
• |S|• Observed (vs. desired ) : max, avg, and for 16 r
epresentative quantiles• Optimal max observed
• Compared 3 algorithms• MRL• Preallocated (1/3 number of stored observations a
s MRL)• Adaptive: allocate a new quantile only when obser
ved error is about to exceed desired
Conclusion
• Better worst-case behavior than previous algorithms
• It does not require a priori knowledge of the parameter N
•
NONO log1log1 2
Any Question ?