idistance -- indexing the distance idistance -- indexing the distance an efficient approach to knn...
Post on 14-Dec-2015
215 Views
Preview:
TRANSCRIPT
iDistanceiDistance-- Indexing the Distance-- Indexing the Distance
An Efficient Approach to KNN Indexing
C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish.Indexing the distance: an efficient method to KNN processing, VLDB 2001.
• Similarity queries: Similarity range and KNN queries
• Similarity range query: Given a query point, find all data points within a given distance r to the query point.
•KNN query: Given a query point, find the K nearest neighbours, in distance to the point.
r
Kth NN
Query RequirementQuery Requirement
• SS-tree : R-tree based index structure; use bounding spheres in internal nodes
• Metric-tree : R-tree based, but use metric distance and bounding spheres
• VA-file : use compression via bit strings for sequential filtering of unwanted data points
• Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN
• A-tree: R-tree based, but use relative bounding boxes
• Problems: hard to integrate into existing DBMSs
Other MethodsOther Methods
Basic DefinitionBasic Definition• Euclidean distance:
• Relationship between data points:
• Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).
Basic Concept of iDistanceBasic Concept of iDistance
• Indexing points based on similarity y = i * c + dist (Si, p)
S1 S2 S3 Sk Sk+1
Reference/anchor points
S1
S2
S3
. . .
d
S1+d c
• Data points are partitioned into clusters/ partitions.
• For each partition, there is a Reference Point that every data point in the partition makes reference to.
• Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree
• Iterative range queries are used in KNN searching.
iDistanceiDistance
•Searching region is enlarged till getting K NN.
A range in B+-tree
KNN SearchingKNN Searching
...
S1
S2
...
dist (S1, q)
S2S1
Increasing search radius : r
Dis_min(S1)
Dis_max(S1)
q
S1 S20
dist(S2, q)
Dis_max(S2)
Dis_min(S2)
Dis_min(S1) Dis_max(S1) Dis_max(S2)
r
dist (S1,q) dist (S2,q)
KNN SearchingKNN Searching
dist (S, q)
Inefficient situation:
•When K= 3, query sphere with radius r will retrieve the 3 NNs.
•Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3)
S q
r
o2
o1
o3
Over Search?
Stopping CriterionStopping Criterion• Theorem 2: The KNN search algorithm
terminates when K NNs are found and the answers are correct.
Case 1: dist(furthest(KNN’), q) < r
Case 2: dist(furthest(KNN’), q) > r
r
Kth ? In case 2
(centroid of hyperplane, closest distance) (external point, closest
distance)
Space-based Partitioning: Space-based Partitioning: Equal-partitioningEqual-partitioning
(centroid of hyper-plane, furthest distance)
Space-based Partitioning:Space-based Partitioning:Equal-partitioning from furthest pointsEqual-partitioning from furthest points
(external point, furthest distance)
• Using external point to reduce searching area
Effect of Reference Points on Effect of Reference Points on Query SpaceQuery Space
• Using (centroid, furthest distance) can greatly reduce search area
The area bounded by these arches is the affected searching area.
Effect on Query SpaceEffect on Query Space
0.67 1.0
0.31
0.20
0.70
0
1.0
Using cluster centroids as reference points
Data-based Partitioning IData-based Partitioning I
0.67 1.0
0.31
0.20
0.70
0
1.0
Using edge points as reference points
Data-based Partitioning IIData-based Partitioning II
• 100K uniform data set• Using (external point, furthest distance)• Effect of search radius on query accuracy
Dimension = 8 Dimension = 16
Dimension = 30
Performance Study:Performance Study: Effect of Search RadiusEffect of Search Radius
• 10-NN queries on 100K uniform data sets • Using (external point, furthest distance)• Effect of search radius on query cost
I/O Cost vs Search RadiusI/O Cost vs Search Radius
•10-NN queries on 100K 30-d uniform data set •Different Reference Points
Effect of Reference PointsEffect of Reference Points
• KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number
Effect of Clustered # of Partitions Effect of Clustered # of Partitions on Accuracyon Accuracy
• 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs
Effect of # of Partitions Effect of # of Partitions on I/O and CPU Coston I/O and CPU Cost
• KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets
Effect of Data SizesEffect of Data Sizes
• 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set
Effect of Clustered Data SetsEffect of Clustered Data Sets
• 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid
Effect of Reference Points on Effect of Reference Points on Clustered Data SetsClustered Data Sets
• 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set
iDistance ideal for Approximate iDistance ideal for Approximate KNN?KNN?
• 10-KNN query on 100K 30-d clustered data sets • C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using
B+-trees.
Performance Study -- Performance Study -- Compare iMinMax and iDistanceCompare iMinMax and iDistance
top related