a spatial index structure for high dimensional point data wei wang, jiong yang, and richard muntz...
DESCRIPTION
Introduction Dynamic spatial index method has been an active research area. –index structure based on spatial decomposition PR-Quad tree, K-D tree, K-D-B tree,... No overlapping among sibling nodes How to achieve high disk page utilization for large dimensionality with skewed data distributions remains a challenge. –R-tree family of index structure R*-tree, SR-tree, X-tree,... Increasing of overlapping among sibling nodes along with increasing dimensionality degrades performance severely.TRANSCRIPT
![Page 1: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/1.jpg)
A Spatial Index Structure for High A Spatial Index Structure for High Dimensional Point DataDimensional Point Data
Wei Wang, Jiong Yang, and Richard MuntzData Mining Lab
Department of Computer ScienceUniversity of California, Los Angeles
![Page 2: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/2.jpg)
OutlineOutline
• Introduction• Structure of PK-tree• Operations on PK-tree• Performance• Conclusions
![Page 3: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/3.jpg)
IntroductionIntroduction
• Dynamic spatial index method has been an active research area. – index structure based on spatial decomposition
• PR-Quad tree, K-D tree, K-D-B tree, ...• No overlapping among sibling nodes• How to achieve high disk page utilization for large dimensionality
with skewed data distributions remains a challenge.
– R-tree family of index structure• R*-tree, SR-tree, X-tree, ... • Increasing of overlapping among sibling nodes along with
increasing dimensionality degrades performance severely.
![Page 4: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/4.jpg)
IntroductionIntroduction
• PK-tree– Spatial decomposition
• no overlapping among sibling nodes– Bound on height– Bounds on number of children– Uniqueness for any data set
• independent of order of insertion and deletion– Solid theoretical foundation– Fast retrieval and updates
![Page 5: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/5.jpg)
Structure of PK-treeStructure of PK-tree
• Recursively rectilinear dividing space
dim 1
dim 2
ith level
(i+1)th level
. . . . . .
. . . . . .
. . .
. . .
. . .. . .
Set notation (e.g., , , , , , , ) is used to express relationships among cells.
![Page 6: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/6.jpg)
Structure of PK-treeStructure of PK-tree
• Space is recursively divided until a level LD such that each cell contains at most one point.
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 0
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 3
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 2
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 1
![Page 7: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/7.jpg)
Structure of PK-treeStructure of PK-tree
• Point cell: a non-empty cell at level LD
• A cell C is K-instantiable iff– C is a point cell, or– there does not exist (K-1) or less K-instantiable sub-cells C1, …, CK-1
C, such that d D (d C d i=0K-1 Ci).
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 3 (LD) Level 2
.. . . . . .
..
.. . .
.. .
. . ..
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
. . . .
Level 1
K = 3
![Page 8: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/8.jpg)
Structure of PK-treeStructure of PK-tree
root
Example of a PK-tree of rank 3
a2 c2d1 b4d2 c3 e1 e2 f3 h1g1 g2 h2 g3 a7 g5f6f5e5d8c8d7b8b7
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 3 (LD)
12345678
a b c d e f g h
U R
12345678
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 1a b c d e f g h
U R
K
NM
B D M N K
12345678
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 2a b c d e f g h
B D
K
M N
![Page 9: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/9.jpg)
Structure of PK-treeStructure of PK-tree
• Given a finite set of points D over index space C0 and dividing ration R, a PK-tree of rank K (K>1) is defined as follows.
– The cell at level 0 (C0) is always instantiated and serves as the root of the PK-tree.
– Every node else (except the root) in the PK-tree is mapped one-to-one to a K-instantiable cell.
– For any two nodes C1 and C2 in the PK-tree, C1 is a child of C2 (or C2 is the parent of C1) iff
• C1 is a proper sub-cell of C2, i.e., C1 C2, and
• there does not exist C3 in the PK-tree such that C1 C3 and C3 C2.
• Properties: existence and uniqueness, bounds on node outdegree, bounded storage space, bounds on expected height, no overlapping among sibling nodes, and so on.
![Page 10: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/10.jpg)
Properties of PK-treeProperties of PK-tree
H
longest path
Expected Height of a PK-tree
Ci
Ci+1 P(d Ci+1 | d Ci) < 1...
at least K-1...
at least K-1...
at least K-1
...at least K-1
leaf
root (N points)
![Page 11: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/11.jpg)
Properties of PK-treeProperties of PK-tree
• M-Level Clustering Spatial Distribution– 0-level: uniform distribution over C0 P(d Ci+1 | d Ci) = 1/r
– 1-level: Let A C0 be some subset of C0 and Ac = C0 - A. Distributions for points in A and Ac are 0-level clustering spatial distribution.
A
Ac Ci+1
Ci
...
...
...
...
1)1(
)|( 1
cc AA
A
AA
Aii DD
DDrD
DCdCdP
![Page 12: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/12.jpg)
Operations on PK-treeOperations on PK-tree
• Pagination of the PK-tree– Pick the parameter K and the number of dimensions to split at each level such that
the maximum size node is close to a page size.– Allocate one node to a page.– Space utilization can be guaranteed to be at least 50% and is much more than 50%
in experiments.• Insertion
– First follow the path from the root to locate all (potential) ancestors of the inserted leaf cell.
– Then from the leaf level back to the root along the same path to make all necessary changes (e.g., instantiate or de-instantiate cells).
• Search– K Nearest Neighbor Query– Range Query
![Page 13: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/13.jpg)
PerformancePerformance
• Setup: Sparc 10 workstation (SunOS 5.5) with 208 MB main memory and a local disk with 9GB capacity
• Synthetic Data Sets (each contains 100,000 points)– u: uniform distribution– c1, c2: 20% of data are uniformly distributed and 80% of data are distributed in
disjoint clusters• Height of generated trees
Dimension 2 4 8 16 32 64
PK-tree (u) 4 4 5 6 7 9
PK-tree (c1) 5 7 7 6 7 8
PK-tree (c2) 7 7 6 7 8 9
X-tree 4 4 4 4 5 6
SR-tree 4 4 5 5 6 7
![Page 14: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/14.jpg)
PerformancePerformance
• Size of index in MB with 100,000 points
2 4 8 16Dimension
u c1 c2 u c1 c2 u c1 c2 u c1 c2
PK-tree 1.8 1.9 1.9 2.8 2.8 2.8 4.9 4.8 4.9 9.4 9.3 9.4
X-tree 1.8 1.8 1.8 3.0 3.0 3.0 5.6 5.5 5.6 10.7 10.4 10.6
SR-tree 69 70 70 74 73 74 74 74 75 90 91 92
![Page 15: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/15.jpg)
PerformancePerformance
• Range query on uniform data distribution
![Page 16: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/16.jpg)
PerformancePerformance
• Range query on clustered data distribution
![Page 17: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/17.jpg)
PerformancePerformance
• KNN query on uniform data distribution
![Page 18: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/18.jpg)
PerformancePerformance
• KNN query on clustered data distribution
![Page 19: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/19.jpg)
PerformancePerformance
• Real data set: NASA Sky Telescope Data– 200,000 two-dimensional points (they are the coordinates of crater
locations on the surface of Mars)
height size KNNCPU
KNNI/O
RANCPU
RANI/O
PK-tree 5 3.7MB 4ms 4 3ms 4
X-tree 4 5.7MB 90ms 4 10ms 4
SR-tree 5 120MB 28ms 8 14ms 6
![Page 20: A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University](https://reader036.vdocuments.us/reader036/viewer/2022062317/5a4d1b947f8b9ab0599c3008/html5/thumbnails/20.jpg)
ConclusionsConclusions
• PK-tree: employing spatial decomposition to ensure no overlapping among sibling nodes but avoiding large number of nodes usually resulting from a skewed spatial distribution of objects.– The total number of nodes in a PK-tree is O(N) and the expected
height of a PK-tree is O(logN) under some general conditions.• Other properties: uniqueness, bounds on number of
children.• Empirical studies shown that the PK-tree outperforms
SR-tree and X-tree by a wide margin.