christian böhm & hans-peter kriegel, ludwig maximilians universität münchen a cost model and...
DESCRIPTION
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join. Feature Based Similarity. Simple Similarity Queries. Specify query object and Find similar objects – range query - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/1.jpg)
Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München
A Cost Model and Index Architecture for the Similarity Join
![Page 2: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/2.jpg)
Chr
istia
n B
öhm
2
Feature Based Similarity
![Page 3: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/3.jpg)
Chr
istia
n B
öhm
3
Simple Similarity Queries
Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.
![Page 4: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/4.jpg)
Chr
istia
n B
öhm
4
Join Applications: Catalogue Matching
Catalogue matching• E.g. Astronomic catalogues
R
S
![Page 5: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/5.jpg)
Chr
istia
n B
öhm
5
Join Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
![Page 6: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/6.jpg)
Chr
istia
n B
öhm
6
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S
![Page 7: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/7.jpg)
Chr
istia
n B
öhm
7
Cost Modeling
Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum
![Page 8: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/8.jpg)
Chr
istia
n B
öhm
8
Cost Modeling
Binomial formula:
![Page 9: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/9.jpg)
Chr
istia
n B
öhm
9
Cost Modeling
Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
![Page 10: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/10.jpg)
Chr
istia
n B
öhm
10
Page Capacity Optimization
Cost model can determine index selectivity which depends on various parameters
Page capacity (number of stored points) is an important parameter
Known from similarity search: Page capacity optimization yields considerable improvement
![Page 11: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/11.jpg)
Chr
istia
n B
öhm
11
Analysis of the Index Overhead
Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ?
CPU:• Distance betw. boxes more
expensive to compute than distance betw. points:
• Smaller capacity more box distance computations
![Page 12: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/12.jpg)
Chr
istia
n B
öhm
12
Analysis of the Index Overhead
Disk I/O:• High constant cost per page access (move disk head)• Page access is by factor 10000 / d more
expensive than continuous reading of a point• Smaller capacity more disk head movement
![Page 13: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/13.jpg)
Chr
istia
n B
öhm
13
Analysis of the Index Overhead
What selectivity is needed that index pays off ?
![Page 14: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/14.jpg)
Chr
istia
n B
öhm
14
Optimization
I/O cost function:
is optimized by
CPU cost function:
is optimized by:
![Page 15: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/15.jpg)
Chr
istia
n B
öhm
15
Optimization
I/O cost:• Large capacity optimum (several 10,000 points, typically)
CPU cost:• Small capacity optimum (< 100 points, typically)
• No compromise achievable
![Page 16: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/16.jpg)
Chr
istia
n B
öhm
16
Multipage Index (MuX)
CPU-performance like CPU optimized index I/O- performance like I/O optimized index
separateoptimization
![Page 17: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/17.jpg)
Chr
istia
n B
öhm
17
Experimental Evaluation
Uniform 4D Uniform 8D
![Page 18: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/18.jpg)
Chr
istia
n B
öhm
18
Experimental Evaluation
CAD Data 16D Color Images 64D
![Page 19: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/19.jpg)
Chr
istia
n B
öhm
19
Conclusions
Summary• High potential for performance gains of the
similarity join by page capacity optimization• Necessary to separately optimize I/O and CPU
Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms
![Page 20: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/20.jpg)
Chr
istia
n B
öhm
20
Consequences
Assume for I/O optimization selectivity Page accesses in a nested block loop like style:
if mindist(r,s) then join (r,s) ;foreach joining R-page r in cache do
load (s) ;if s joins some of the cached R-pg then
foreach S-page s dofill cache with pages of R (1 page free) ;
![Page 21: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816844550346895dde1c70/html5/thumbnails/21.jpg)
Chr
istia
n B
öhm
21
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S