christian böhm & hans-peter kriegel, ludwig maximilians universität münchen a cost model and...
Post on 02-Jan-2016
215 Views
Preview:
TRANSCRIPT
Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München
A Cost Model and Index Architecture for the Similarity Join
Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München
A Cost Model and Index Architecture for the Similarity Join
Chr
isti
an B
öhm
2
Feature Based SimilarityFeature Based Similarity
Chr
isti
an B
öhm
3
Simple Similarity QueriesSimple Similarity Queries
Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.
Chr
isti
an B
öhm
4
Join Applications: Catalogue MatchingJoin Applications: Catalogue Matching
Catalogue matching• E.g. Astronomic catalogues
R
S
Chr
isti
an B
öhm
5
Join Applications: ClusteringJoin Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
Chr
isti
an B
öhm
6
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S
Chr
isti
an B
öhm
7
Cost ModelingCost Modeling
Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum
Chr
isti
an B
öhm
8
Cost ModelingCost Modeling
Binomial formula:
Chr
isti
an B
öhm
9
Cost ModelingCost Modeling
Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
Chr
isti
an B
öhm
10
Page Capacity OptimizationPage Capacity Optimization
Cost model can determine index selectivity which depends on various parameters
Page capacity (number of stored points) is an important parameter
Known from similarity search: Page capacity optimization yields considerable improvement
Chr
isti
an B
öhm
11
Analysis of the Index OverheadAnalysis of the Index Overhead
Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ?
CPU:• Distance betw. boxes more
expensive to compute than distance betw. points:
• Smaller capacity more box distance computations
Chr
isti
an B
öhm
12
Analysis of the Index OverheadAnalysis of the Index Overhead
Disk I/O:• High constant cost per page access (move disk head)• Page access is by factor 10000 / d more
expensive than continuous reading of a point• Smaller capacity more disk head movement
Chr
isti
an B
öhm
13
Analysis of the Index OverheadAnalysis of the Index Overhead
What selectivity is needed that index pays off ?
Chr
isti
an B
öhm
14
OptimizationOptimization
I/O cost function:
is optimized by
CPU cost function:
is optimized by:
Chr
isti
an B
öhm
15
OptimizationOptimization
I/O cost:• Large capacity optimum (several 10,000 points, typically)
CPU cost:• Small capacity optimum (< 100 points, typically)
• No compromise achievable
Chr
isti
an B
öhm
16
Multipage Index (MuX)Multipage Index (MuX)
CPU-performance like CPU optimized index
I/O- performance like I/O optimized index
separateoptimization
Chr
isti
an B
öhm
17
Experimental EvaluationExperimental Evaluation
Uniform 4D Uniform 8D
Chr
isti
an B
öhm
18
Experimental EvaluationExperimental Evaluation
CAD Data 16D Color Images 64D
Chr
isti
an B
öhm
19
ConclusionsConclusions
Summary• High potential for performance gains of the
similarity join by page capacity optimization• Necessary to separately optimize I/O and CPU
Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms
Chr
isti
an B
öhm
20
ConsequencesConsequences
Assume for I/O optimization selectivity Page accesses in a nested block loop like style:
if mindist(r,s) then join (r,s) ;
foreach joining R-page r in cache doload (s) ;if s joins some of the cached R-pg then
foreach S-page s dofill cache with pages of R (1 page free) ;
Chr
isti
an B
öhm
21
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S
top related