a spatial index structure for high dimensional point data wei wang, jiong yang, and richard muntz...

A Spatial Index Structure for High A Spatial Index Structure for High Dimensional Point DataDimensional Point Data

Wei Wang, Jiong Yang, and Richard MuntzData Mining Lab

Department of Computer ScienceUniversity of California, Los Angeles

OutlineOutline

• Introduction• Structure of PK-tree• Operations on PK-tree• Performance• Conclusions

IntroductionIntroduction

• Dynamic spatial index method has been an active research area. – index structure based on spatial decomposition

• PR-Quad tree, K-D tree, K-D-B tree, ...• No overlapping among sibling nodes• How to achieve high disk page utilization for large dimensionality

with skewed data distributions remains a challenge.

– R-tree family of index structure• R*-tree, SR-tree, X-tree, ... • Increasing of overlapping among sibling nodes along with

increasing dimensionality degrades performance severely.

IntroductionIntroduction

• PK-tree– Spatial decomposition

• no overlapping among sibling nodes– Bound on height– Bounds on number of children– Uniqueness for any data set

• independent of order of insertion and deletion– Solid theoretical foundation– Fast retrieval and updates

Structure of PK-treeStructure of PK-tree

• Recursively rectilinear dividing space

dim 1

dim 2

ith level

(i+1)th level

. . . . . .

. . . . . .

. . .

. . .

. . .. . .

Set notation (e.g., , , , , , , ) is used to express relationships among cells.


• Space is recursively divided until a level LD such that each cell contains at most one point.

. . . ..

. . . . . ..

..

. . ..

. .. . .

.

Level 0

. . . ..

. . . . . ..

..

. . ..

. .. . .

.

Level 3

. . . ..

. . . . . ..

..

. . ..

. .. . .

.

Level 2

. . . ..

. . . . . ..

..

. . ..

. .. . .

.

Level 1


• Point cell: a non-empty cell at level LD

• A cell C is K-instantiable iff– C is a point cell, or– there does not exist (K-1) or less K-instantiable sub-cells C1, …, CK-1

C, such that d D (d C d i=0K-1 Ci).

.. . . . . .

..

.. . .

.. .

. . ..

. . . .

Level 3 (LD) Level 2

.. . . . . .

..

.. . .

.. .

. . ..

. . . ..

. . . . . ..

..

. . ..

. .. . .

.

. . . .

Level 1

K = 3


root

Example of a PK-tree of rank 3

a2 c2d1 b4d2 c3 e1 e2 f3 h1g1 g2 h2 g3 a7 g5f6f5e5d8c8d7b8b7

.. . . . . .

..

.. . .

.. .

. . ..

. . . .

Level 3 (LD)

12345678

a b c d e f g h

U R

12345678

.. . . . . .

..

.. . .

.. .

. . ..

. . . .

Level 1a b c d e f g h

U R

K

NM

B D M N K

12345678

.. . . . . .

..

.. . .

.. .

. . ..

. . . .

Level 2a b c d e f g h

B D

K

M N


• Given a finite set of points D over index space C0 and dividing ration R, a PK-tree of rank K (K>1) is defined as follows.

– The cell at level 0 (C0) is always instantiated and serves as the root of the PK-tree.

– Every node else (except the root) in the PK-tree is mapped one-to-one to a K-instantiable cell.

– For any two nodes C1 and C2 in the PK-tree, C1 is a child of C2 (or C2 is the parent of C1) iff

• C1 is a proper sub-cell of C2, i.e., C1 C2, and

• there does not exist C3 in the PK-tree such that C1 C3 and C3 C2.

• Properties: existence and uniqueness, bounds on node outdegree, bounded storage space, bounds on expected height, no overlapping among sibling nodes, and so on.

Properties of PK-treeProperties of PK-tree

H

longest path

Expected Height of a PK-tree

Ci

Ci+1 P(d Ci+1 | d Ci) < 1...

at least K-1...

at least K-1...

at least K-1

...at least K-1

leaf

root (N points)

Properties of PK-treeProperties of PK-tree

• M-Level Clustering Spatial Distribution– 0-level: uniform distribution over C0 P(d Ci+1 | d Ci) = 1/r

– 1-level: Let A C0 be some subset of C0 and Ac = C0 - A. Distributions for points in A and Ac are 0-level clustering spatial distribution.

A

Ac Ci+1

Ci

...

...

...

...

1)1(

)|( 1

cc AA

A

AA

Aii DD

DDrD

DCdCdP

Operations on PK-treeOperations on PK-tree

• Pagination of the PK-tree– Pick the parameter K and the number of dimensions to split at each level such that

the maximum size node is close to a page size.– Allocate one node to a page.– Space utilization can be guaranteed to be at least 50% and is much more than 50%

in experiments.• Insertion

– First follow the path from the root to locate all (potential) ancestors of the inserted leaf cell.

– Then from the leaf level back to the root along the same path to make all necessary changes (e.g., instantiate or de-instantiate cells).

• Search– K Nearest Neighbor Query– Range Query

PerformancePerformance

• Setup: Sparc 10 workstation (SunOS 5.5) with 208 MB main memory and a local disk with 9GB capacity

• Synthetic Data Sets (each contains 100,000 points)– u: uniform distribution– c1, c2: 20% of data are uniformly distributed and 80% of data are distributed in

disjoint clusters• Height of generated trees

Dimension 2 4 8 16 32 64

PK-tree (u) 4 4 5 6 7 9

PK-tree (c1) 5 7 7 6 7 8

PK-tree (c2) 7 7 6 7 8 9

X-tree 4 4 4 4 5 6

SR-tree 4 4 5 5 6 7


• Size of index in MB with 100,000 points

2 4 8 16Dimension

u c1 c2 u c1 c2 u c1 c2 u c1 c2

PK-tree 1.8 1.9 1.9 2.8 2.8 2.8 4.9 4.8 4.9 9.4 9.3 9.4

X-tree 1.8 1.8 1.8 3.0 3.0 3.0 5.6 5.5 5.6 10.7 10.4 10.6

SR-tree 69 70 70 74 73 74 74 74 75 90 91 92


• Range query on uniform data distribution


• Range query on clustered data distribution


• KNN query on uniform data distribution


• KNN query on clustered data distribution


• Real data set: NASA Sky Telescope Data– 200,000 two-dimensional points (they are the coordinates of crater

locations on the surface of Mars)

height size KNNCPU

KNNI/O

RANCPU

RANI/O

PK-tree 5 3.7MB 4ms 4 3ms 4

X-tree 4 5.7MB 90ms 4 10ms 4

SR-tree 5 120MB 28ms 8 14ms 6

ConclusionsConclusions

• PK-tree: employing spatial decomposition to ensure no overlapping among sibling nodes but avoiding large number of nodes usually resulting from a skewed spatial distribution of objects.– The total number of nodes in a PK-tree is O(N) and the expected

height of a PK-tree is O(logN) under some general conditions.• Other properties: uniqueness, bounds on number of

children.• Empirical studies shown that the PK-tree outperforms

SR-tree and X-tree by a wide margin.

a spatial index structure for high dimensional point data wei wang, jiong yang, and richard muntz...

Documents