the power-method: a comprehensive estimation technique for multi-dimensional queries
DESCRIPTION
The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries. Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST. Roadmap. Problem – motivation Survey Proposed method – main idea Proposed method – details - PowerPoint PPT PresentationTRANSCRIPT
The Power-Method: A The Power-Method: A Comprehensive Comprehensive
Estimation Technique for Estimation Technique for Multi-Dimensional QueriesMulti-Dimensional Queries
Yufei Tao U. Hong Kong
Christos Faloutsos CMU
Dimitris Papadias Hong Kong UST
Tao, Faloutsos, Papadias 2
RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions
Tao, Faloutsos, Papadias 3
Target query typesTarget query types
DB = set of m –d points. Range search (RS) k nearest neighbor (KNN) Regional distance (self-) join
(RDJ) in Louisiana, find all pairs of music
stores closer than 1mi to each other
Tao, Faloutsos, Papadias 4
Target problemTarget problem
Estimate Query selectivity Query (I/O) cost
for any Lp metric using a single method
Tao, Faloutsos, Papadias 5
Target ProblemTarget Problem
for any Lp metric using a single method
RS KNN RDJ
Sel. XXXX
I/O
Tao, Faloutsos, Papadias 6
RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions
Tao, Faloutsos, Papadias 7
Older Query estimation Older Query estimation approachesapproaches
Vast literature Sampling, kernel estimation, single
value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc
BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear
Tao, Faloutsos, Papadias 8
Main competitorsMain competitors Local method
Representative methods: Histograms
Global method Provides a single estimate
corresponding to the average selectivity/cost of all queries, independently of their locations
Representative methods: Fractal and power law
Tao, Faloutsos, Papadias 9
Rationale and problems of Rationale and problems of histogramshistograms
Partition the data space into a set of buckets and assume (local) uniformity
b1
qb2
b3
vincinity circleProblems
uniformity
tricky/slow estimations, for all but the L norm
Tao, Faloutsos, Papadias 10
RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions
Tao, Faloutsos, Papadias 11
Inherent defect of Inherent defect of histogramshistograms
Density trap – what is the density in the vicinity of q?
1
query point q
r
diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01
Q: What is going on?
10
Tao, Faloutsos, Papadias 12
Inherent defect of Inherent defect of histogramshistograms
Density trap – what is the density in the vicinity of q?
1
query point q
r
diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01
Q: What is going on?A: we ask a silly question: ~ “what is the area of a line?”
10
Tao, Faloutsos, Papadias 13
““Density Trap”Density Trap” Not caused not by a
mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object!
This ‘trap’ will appear for any non-uniform dataset
Almost ALL real point-sets are non-uniform -> the trap is real
Tao, Faloutsos, Papadias 14
““Density Trap”Density Trap”
In short:
is meaningless What should we do instead?
areaneighborsofcount /__
Tao, Faloutsos, Papadias 15
““Density Trap”Density Trap”
In short:
is meaningless What should we do instead? A: log(count_of_neighbors) vs
log(area)
areaneighborsofcount /__
Tao, Faloutsos, Papadias 16
Local power lawLocal power law In more detail: ‘local power law’:
nb: # neighbors of point p, within radius r
cp: ‘local constant’
np : ‘local exponent’ (= local intrinsic dimensionality)
pn
p pnb r c r
1
query point q
r
Tao, Faloutsos, Papadias 17
Local power lawLocal power law
Intuitively: to avoid the ‘density trap’, use
np:local intrinsic dimensionality
instead of density
1
query point q
r
Tao, Faloutsos, Papadias 18
Does LPL make sense?Does LPL make sense? For point ‘q’: LPL gives
nbq(r) = <constant> r1
(no need for ‘density’, nor uniformity)
1
query point q
r
diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01
10
Tao, Faloutsos, Papadias 19
Local power law and LxLocal power law and Lxif a point obeys L.P.L under L,
ditto for any other Lx metric,
with same ‘local exponent’
-> LPL works easily, for ANY Lx metric
/1
1
p
p
n m
nxxp p
VolSpherenb r c r
VolSphere
Tao, Faloutsos, Papadias 20
ExamplesExamples
p1
p2
1
10
100
1k
10k
0.001 0.01 0.1r
nbp (r)p1
p2
p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’than p2
radius
#neighbors(<=r)
p1
p2
Tao, Faloutsos, Papadias 22
RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions
Tao, Faloutsos, Papadias 23
Proposed methodProposed method Main idea: if we know (or can
approximate) the cp and np of every point p, we can solve all the problems:
Tao, Faloutsos, Papadias 24
Target ProblemTarget Problem
for any Lp metric using a single method
RS KNN RDJ
Sel. XXXX
I/O
Tao, Faloutsos, Papadias 25
Target ProblemTarget Problem
for any Lp metric (Lemma3.2) using a single method
RS KNN RDJSel. Thm3.
1XXXX Thm3.
2I/O Thm3.
3Thm3.4
Thm3.5
Tao, Faloutsos, Papadias 26
Theoretical resultsTheoretical results
interesting observation:
(Thm3.4): the cost of a kNN query q depends
only on the ‘local exponent’ and NOT on the ‘local constant’, nor on the cardinality of the
dataset
Tao, Faloutsos, Papadias 27
ImplementationImplementation Given a query point q, we need
its local exponent and constants to perform estimation
but: too expensive to store, for every point. Q: What to do?
Tao, Faloutsos, Papadias 28
ImplementationImplementation Given a query point q, we need
its local exponent and constants to perform estimation
but: too expensive to store, for every point. Q: What to do?
A: exploit locality:
Tao, Faloutsos, Papadias 29
ImplementationImplementation nearby points: usually have
similar local constants and exponents. Thus, one solution:
‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q
Tao, Faloutsos, Papadias 30
ImplementationImplementation choose anchors: with sampling,
DBS, or any other method.
Tao, Faloutsos, Papadias 31
ImplementationImplementation (In addition to ‘anchors’, we
also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)
Tao, Faloutsos, Papadias 32
Experiments - SettingsExperiments - Settings Datasets
SC that contain 40k points representing the coast lines of Scandinavia
LB that include 53k points corresponding to locations in Long Beach county
Structure: R*-tree Compare Power method to
Minskew Global method (fractal)
Tao, Faloutsos, Papadias 33
Experiments - SettingsExperiments - Settings The LPLaw coefficients of each
anchor point are computed using L∞ 0.05-neighborhoods
Queries: Biased (following the data distribution) A query workload contains 500
queries
We report the average error i|actiesti|/iacti
Tao, Faloutsos, Papadias 34
Target ProblemTarget Problem
for any Lp metric (Lemma3.2) using a single method
RS KNN RDJSel. Thm3.
1XXXX Thm3.
2I/O Thm3.
3Thm3.4
Thm3.5
Tao, Faloutsos, Papadias 35
Range search selectivityRange search selectivity
0102030405060708090
100
0 0.02 0.04 0.06 0.08 0.1
estimation error (%)
r
0
10
20
30
40
50
60
70
0 0.02 0.04 0.06 0.08 0.1
estimation error (%)
r
minskew powerglobal
the LPL method wins
Tao, Faloutsos, Papadias 36
Target ProblemTarget Problem
for any Lp metric (Lemma3.2) using a single method
RS KNN RDJSel. Thm3.
1XXXX Thm3.
2I/O Thm3.
3Thm3.4
Thm3.5
Tao, Faloutsos, Papadias 37
No known global method in this case The LPL method wins, with higher
margin
Regional distance join Regional distance join selectivityselectivity
0
10
20
30
40
50
60
70
80
90
100
0 0.002 0.004 0.006 0.008 0.01t
estimation error (%)
0 0.002 0.004 0.006 0.008 0.01t
0
10
20
30
40
50
60
70 estimation error (%)
minskew power
Tao, Faloutsos, Papadias 38
Target ProblemTarget Problem
for any Lp metric (Lemma3.2) using a single method
RS KNN RDJSel. Thm3.
1XXXX Thm3.
2I/O Thm3.
3Thm3.4
Thm3.5
Tao, Faloutsos, Papadias 39
Range search query costRange search query cost
0 0.02 0.04 0.06 0.08 0.1r
0
10
20
30
40
50
60
70
80 estimation error (%)
0
10
20
30
40
50
60
70
80
90 estimation error (%)
0 0.02 0.04 0.06 0.08 0.1r
minskew powerglobal
Tao, Faloutsos, Papadias 40
k nearest neighbor costk nearest neighbor cost
30
40
80
estimation error (%)
1 20 40 60 80 100k
0
10
20
50
60
7090
30
4540
estimation error (%)
1 20 40 60 80 100k
05
10152025
35
50
local uniformity powerglobal
Tao, Faloutsos, Papadias 41
Regional distance join costRegional distance join cost
0 0.002 0.004 0.006 0.008 0.01t
0
20
40
60
80
100
120 estimation error (%)
0 0.002 0.004 0.006 0.008 0.01t
0
10
20
30
40
50
60
70 estimation error (%)
minskew power
Tao, Faloutsos, Papadias 42
ConclusionsConclusions We spot the “density trap”
problem of the local uniformity assumption (<- histograms)
we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’)
and we solved all posed problems:
1
query point q
r
Tao, Faloutsos, Papadias 43
Conclusions – cont’dConclusions – cont’d
for any Lp metric using a single method
RS KNN RDJ
Sel. XXXX
I/O
Tao, Faloutsos, Papadias 44
Conclusions – cont’dConclusions – cont’d
for any Lp metric (Lemma3.2) using a single method (LPL & ‘anchors’)
RS KNN RDJSel. Thm3.
1XXXX Thm3.
2I/O Thm3.
3Thm3.4
Thm3.5