streaming quantiles - edoliberty.github.io
TRANSCRIPT
StreamingQuantiles
EdoLibertyPrincipalScientistAmazonWebServices
StreamingQuantiles
Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.
ProblemDefinition
n
0 n
R( ) = 0.6 · n
CreateasketchforsuchthatR0 |R0(x)�R(x)| "n
• UniformsamplingFastandsimpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
Solutions
log(1/")/✏
O(1/"2)
log(n)/"
Previouslyconjecturedspaceoptimalforallalgorithms.lowerboundfordeterministic algorithmsbyHungandTing2010.log(1/")/✏
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)FastsimpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
Solutionscont’
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
Bufferbasedsolutions
Thebasicbufferidea
1 0 35 4 7
Bufferofsizek
Thebasicbufferidea
Storeskstreamentries
1
03
5
47
Thebasicbufferidea
Thebuffersortskstreamentries
10
3
54
7
Thebasicbufferidea
Deleteseveryotheritem
10
3
54
7
Thebasicbufferidea
Andoutputstherestwithdoubletheweight
035
Thebasicbufferidea
0
0
x x
1 54 7
1
3
3
4
5
7
R(x) = 2
R
0(x) = 2
R
0(x) = 2
R(x) = 5
R
0(x) = 4
R
0(x) = 6
Thebasicbufferidea
Repeattimeuntiltheendofthestream
0
|R0(x)�R(x)| < n/k
nn/2
n/k
1 0 355
n
Buffersofsize k
1 0 35
Manku-Rajagopalan-Lindsay(MRL)sketch
|R0(x)�R(x)| n/k · log2(n)
H = log2(n)
k = log2(n)/"Ifwesetweget
whilemaintainingatmoststreamitems.
|R0(x)�R(x)| "n
Manku-Rajagopalan-Lindsay(MRL)sketch
H · k log
22(n)/"
Manku-Rajagopalan-Lindsay(MRL)sketchFast,SimpleFullymergeable Space
Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)
Buffersofsize klog(1/")
startsamplingafteritemsO(1/"2)
log
2(1/")/"Reducesspaceusagetoitemsfromthestream.
1 0 35
Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)
E[R0(x)] = R(x)
R
0(x) isarandomvariablenowand
R(x) = 1
R
0(x) = 2
R
0(x) = 0
x
Reducesspaceusagetoitemsfromthestream.log
3/2(1/")/"
5 7
5
7
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
Recap
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
• Karnin,Lang,LibertyFast,simpleFullymergeable Space
Ourgoal
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
log(1/")/✏
Observation
h = H
Thefirstbufferscontributeverylittletotheerror.Theyare“toogood”.
wh = 2h�1
Numberofcompactions
Weightofitemsinthelevel
h = 2 h = 1....
mh = 2H�h�1
Idea
h = H
h = 2h = 1
kh � kcH�hLetbuffersshrinkat-most-exponentially
wh = 2h�1
H log(n/ck) + 2
mh (2/c)H�h�1
Numberofcompactions
TBDlater
Pr [|R(x,H 0)�R(x)| � "n] 2 exp
⇣�C"2k222(H�H0)
⌘
therankof among1. Theitemsyieldedbythecompactoratheight2. Alltheitemsstoredinthecompactorsofheights h0 h
hR(h, x) x
Claim,for
ProofUseHoeffding’s inequalityon
HX
h=1
[R(x, h)�R(x, h� 1)]
C = c2(2c� 1)
Analysis
Setand
Solution1
kh = dkcH�he+ 1c = 2/3• Karnin,Lang,Liberty(1)
Fast,simpleFullymergeable Spacep
log(1/")/"
exponentiallydecreasingcapacitybuffers
log(n) samplerreplacesall buffersofsize2
Betterthanpreviouslyconjecturedoptimal!
exponentiallydecreasingcapacitybuffers
Setandexceptthatthetopbuffersallhavecapacity.
Solution2(KLL+MRL)
kh = dkcH�he+ 1c = 2/3
• Karnin,Lang,Liberty(2)Fast,simpleFullymergeable Space
log log(1/") k
log
2log(1/")/"
log log(1/")Buffersofcapacityk
log(n) samplerreplacesall buffersofsize2
• Karnin,Lang,Liberty(3)Fast,simpleFullymergeable Space
exponentiallydecreasingcapacitybuffers
SetandreplacethetopwithaGKsketch
Solution3(KLL+GK)
kh = dkcH�he+ 1c = 2/3log log(1/") k
log log(1/")GKsketchreplaces
toplevelslog(n) samplerreplaces
all buffersofsize2
GKSketch
log log(1/")/"
CountDistinct(DemoOnly)
$ head data.csv0103023732
Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.
Assumeyouneedtoestimatethedistributionofnumbersinafile
$ time wc -lc data.csv10000000 76046666 data.csv
real 0m0.101suser 0m0.072ssys 0m0.021s
Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.
Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.
$ time cat data.csv | python quantiles.py > /dev/null
real 0m13.406suser 0m12.937ssys 0m0.407s
Inpythonitlookslikethis:
$ cat quantiles.pyimport sysints = sorted([int(x) for x in sys.stdin])for i in range(0,len(ints),int(len(ints)/100)):
print(str(ints[i]))
$ time cat data.csv | sketch rank > /dev/null
real 0m1.495suser 0m1.878ssys 0m0.141s
Thisisthewaytodothiswiththesketchinglibrary
$ time cat data.csv | sketch rank
ToofasttousethesystemmonitorUI...
Ituses~4kofmemory!
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
exactandapproximatequantiles
approximatequantiles exactquantiles
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
0 2000000 4000000 6000000 8000000 10000000
exactvsapproximatequantiles
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
100 1000 10000 100000 1e+06
Err
or
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
0
500
1000
1500
2000
2500
3000
3500
4000
100 1000 10000 100000 1e+06
Space
Use
d F
or
Sto
ring S
am
ple
s
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
Someexperimentalresults
Thankyou!