the power-method: a comprehensive estimation technique for multi-dimensional queries

The Power-Method: A The Power-Method: A Comprehensive Comprehensive

Estimation Technique for Estimation Technique for Multi-Dimensional QueriesMulti-Dimensional Queries

Yufei Tao U. Hong Kong

Christos Faloutsos CMU

Dimitris Papadias Hong Kong UST

Tao, Faloutsos, Papadias 2

RoadmapRoadmap Problem – motivation Survey Proposed method – main idea Proposed method – details Experiments Conclusions


Target query typesTarget query types

DB = set of m –d points. Range search (RS) k nearest neighbor (KNN) Regional distance (self-) join

(RDJ) in Louisiana, find all pairs of music

stores closer than 1mi to each other


Target problemTarget problem

Estimate Query selectivity Query (I/O) cost

for any Lp metric using a single method


Target ProblemTarget Problem


RS KNN RDJ

Sel. XXXX

I/O


Older Query estimation Older Query estimation approachesapproaches

Vast literature Sampling, kernel estimation, single

value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc

BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear


Main competitorsMain competitors Local method

Representative methods: Histograms

Global method Provides a single estimate

corresponding to the average selectivity/cost of all queries, independently of their locations

Representative methods: Fractal and power law


Rationale and problems of Rationale and problems of histogramshistograms

Partition the data space into a set of buckets and assume (local) uniformity

b1

qb2

b3

vincinity circleProblems

uniformity

tricky/slow estimations, for all but the L norm


Inherent defect of Inherent defect of histogramshistograms

Density trap – what is the density in the vicinity of q?

1

query point q

r

diameter=10: 10/100 = 0.1diameter=100: 100/10,000 = 0.01

Q: What is going on?

10


Inherent defect of Inherent defect of histogramshistograms

Density trap – what is the density in the vicinity of q?

1

query point q

r


Q: What is going on?A: we ask a silly question: ~ “what is the area of a line?”

10


““Density Trap”Density Trap” Not caused not by a

mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object!

This ‘trap’ will appear for any non-uniform dataset

Almost ALL real point-sets are non-uniform -> the trap is real


““Density Trap”Density Trap”

In short:

is meaningless What should we do instead?

areaneighborsofcount /__


““Density Trap”Density Trap”

In short:

is meaningless What should we do instead? A: log(count_of_neighbors) vs

log(area)

areaneighborsofcount /__


Local power lawLocal power law In more detail: ‘local power law’:

nb: # neighbors of point p, within radius r

cp: ‘local constant’

np : ‘local exponent’ (= local intrinsic dimensionality)

pn

p pnb r c r

1

query point q

r


Local power lawLocal power law

Intuitively: to avoid the ‘density trap’, use

np:local intrinsic dimensionality

instead of density

1

query point q

r


Does LPL make sense?Does LPL make sense? For point ‘q’: LPL gives

nbq(r) = <constant> r1

(no need for ‘density’, nor uniformity)

1

query point q

r


10


Local power law and LxLocal power law and Lxif a point obeys L.P.L under L,

ditto for any other Lx metric,

with same ‘local exponent’

-> LPL works easily, for ANY Lx metric

/1

1

p

p

n m

nxxp p

VolSpherenb r c r

VolSphere


ExamplesExamples

p1

p2

1

10

100

1k

10k

0.001 0.01 0.1r

nbp (r)p1

p2

p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’than p2

radius

#neighbors(<=r)

p1

p2


Proposed methodProposed method Main idea: if we know (or can

approximate) the cp and np of every point p, we can solve all the problems:




RS KNN RDJ

Sel. XXXX

I/O



for any Lp metric (Lemma3.2) using a single method

RS KNN RDJSel. Thm3.

1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5


Theoretical resultsTheoretical results

interesting observation:

(Thm3.4): the cost of a kNN query q depends

only on the ‘local exponent’ and NOT on the ‘local constant’, nor on the cardinality of the

dataset


ImplementationImplementation Given a query point q, we need

its local exponent and constants to perform estimation

but: too expensive to store, for every point. Q: What to do?


ImplementationImplementation Given a query point q, we need

its local exponent and constants to perform estimation

but: too expensive to store, for every point. Q: What to do?

A: exploit locality:


ImplementationImplementation nearby points: usually have

similar local constants and exponents. Thus, one solution:

‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q


ImplementationImplementation choose anchors: with sampling,

DBS, or any other method.


ImplementationImplementation (In addition to ‘anchors’, we

also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)


Experiments - SettingsExperiments - Settings Datasets

SC that contain 40k points representing the coast lines of Scandinavia

LB that include 53k points corresponding to locations in Long Beach county

Structure: R*-tree Compare Power method to

Minskew Global method (fractal)


Experiments - SettingsExperiments - Settings The LPLaw coefficients of each

anchor point are computed using L∞ 0.05-neighborhoods

Queries: Biased (following the data distribution) A query workload contains 500

queries

We report the average error i|actiesti|/iacti





1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5


Range search selectivityRange search selectivity

0102030405060708090

100

0 0.02 0.04 0.06 0.08 0.1

estimation error (%)

r

0

10

20

30

40

50

60

70

0 0.02 0.04 0.06 0.08 0.1


r

minskew powerglobal

the LPL method wins





1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5


No known global method in this case The LPL method wins, with higher

margin

Regional distance join Regional distance join selectivityselectivity

0

10

20

30

40

50

60

70

80

90

100

0 0.002 0.004 0.006 0.008 0.01t


0 0.002 0.004 0.006 0.008 0.01t

0

10

20

30

40

50

60

70 estimation error (%)

minskew power





1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5


Range search query costRange search query cost

0 0.02 0.04 0.06 0.08 0.1r

0

10

20

30

40

50

60

70


0

10

20

30

40

50

60

70

80


0 0.02 0.04 0.06 0.08 0.1r

minskew powerglobal


k nearest neighbor costk nearest neighbor cost

30

40

80


1 20 40 60 80 100k

0

10

20

50

60

7090

30

4540


1 20 40 60 80 100k

05

10152025

35

50

local uniformity powerglobal


Regional distance join costRegional distance join cost

0 0.002 0.004 0.006 0.008 0.01t

0

20

40

60

80

100


0 0.002 0.004 0.006 0.008 0.01t

0

10

20

30

40

50

60


minskew power


ConclusionsConclusions We spot the “density trap”

problem of the local uniformity assumption (<- histograms)

we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’)

and we solved all posed problems:

1

query point q

r


Conclusions – cont’dConclusions – cont’d


RS KNN RDJ

Sel. XXXX

I/O


Conclusions – cont’dConclusions – cont’d

for any Lp metric (Lemma3.2) using a single method (LPL & ‘anchors’)


1XXXX Thm3.

2I/O Thm3.

3Thm3.4

Thm3.5

the power-method: a comprehensive estimation technique for multi-dimensional queries

Documents