indexing for similarity search - cs.princeton.edu€¦ · reference: indexing survey • searching...
TRANSCRIPT
Indexing for Similarity Search
Qin Lv
COS598E Spring 2005
Outline
• Problem • High-dimensional indexing: previous study
– Quadtree, kd-tree, R-tree, … • Locality sensitive hashing (LSH)• Navigating nets• Cover trees• Conclusion
Problem• Feature vectors: points in high-
dimensional space• Range query• Nearest neighbor query
– K-nearest neighbor query– Approximate nearest neighbor query
• Linear scan• Indexing: preprocess/organize data points
in order to answer queries efficiently
Reference: Indexing Survey
• Searching in high-dimensional spaces -Index structures for improving the performance of multimedia databases– Christian Böhm, Stefan Berchtold, Daniel A. Keim– ACM Computing Surveys (CSUR) – Volume 33 , Issue 3, Pages: 322 - 373– September 2001
Quadtree
• At each level, splits a space into 2d equal subspaces– 4 subspaces in 2-dimensional space, hence the name
• Very simple data structure• But
– Empty spaces– Space exponential in d– Time exponential in d
k-d Tree
• Splits in one dimension each time • Adaptive: instead of splitting in the middle,
choose the split carefully • No (or less) empty spaces• Space linear to d• Exponential query time
still possible
R-tree• [Gut 84] Splits space using minimum bounding
rectangles (MBRs)• Insertion: starting from root, each time picks a
child region, splits a region when needed • Rules:
– p is contained in exactly one region
– p is contained in multiple regions
– No region contains p
R-tree (cnt’d)
• Basic problem: – Overlap at high index levels and propagates
down by misled insert operations
R*-tree• [BKSS 90] an extension of R-tree• Minimize overlap between regions
– Picks the data region that yields the smallest enlargement of overlap
– Picks the split plane that minimizes the overlap between regions• Minimize the surface of regions
– When splitting, picks the dimension that yields the smallest surface areas of all MBRs
• Minimize the volume covered by internal nodes– Picks the internal region that yields the smallest volume
enlargement• Maximize the storage utilization
– forced re-insert : certain percentage of points with the largest distances from the region center are deleted and re-inserted
R*-tree (cnt’d)
• 10% - 75% improvements over R-tree• In higher-dimensional spaces
– Deteriorated directory (internal nodes)– Needs to load the entire index in order to process
most queries• Heuristics to optimize for regions with smaller
surface is beneficial
R+-tree
• [SSH 86; SRF 87]• An overlap-free variant of R-tree• Uses forced-split to avoid overlap• High dimensionality leads to many forced
split operations• Low storage utilization
X-tree
• [BKK 96] An extension of R*-tree• Overlap-free split according to a split history• Supernodes with enlarged page capacity
X-tree (cnt’d)
• Small dimensions– similar performance to R-tree
• Medium dimensions – high performance gain compared to R*-tree
for all query types• High dimensions
– Also needs to visit large number of nodes– Linear scan is less expensive
SS-tree
• [WJ 96] uses spheres as regions – (centroid point, minimum radius)
• Insertion: at each level, chooses the child sphere whose centroid is closest to p
• Forced re-insert: 30% points with largest distances to centroid are deleted and re-inserted
SS-tree (cnt’d)
• Although spheres are theoretically superior to volume-equivalent MRBs– Overlap-free split is difficult for spheres
• Performance– Outperforms R*-tree by a factor of 2– Not as good as LSDh-tree and X-tree
SR-tree
• [KS 97] A combination of R*-tree and SS-tree– Region: intersection between a rectangle and
a sphere– 2d values for MBRs– d+1 values for spheres
• Insert and split operationssimilar to SS-tree
SR-tree (cnt’d)
• Reports better performance than SS-tree and R*-tree
• Probably not as good as LSDh-tree and X-tree
Space Filling Curves• Mappings from d-dimensional space to one-
dimensional space• Points that are close in original space are likely to be
close in the embedded space• Embedded space can be indexed by B-tree
noAccording to space filling curve
Unique due to complete, disjoint partition
YesYesUnion of rectangles
Space filling curves
yesVarianceProximity to centroidNoNoIntersect sphere/MBR
SR-tree
yesVarianceProximity to centroidNoNoSphereSS-tree
noCyclic change of dim.# of distinct values
Unique due to complete, disjoint partition
No/yesYesK-d-tree region
LSDh-tree
noSplit historySurface/overlapDead space coverage
Overlap enlargementVolume enlargementVolume
NoNoMBRX-tree
YesSurface areaOverlapDead space coverage
Overlap enlargementVolume enlargementVolume
NoNoMBRR*-tree
noVarious algorithmsVolume enlargementVolume
NoNoMBRR-tree
Re-insert
Criteria for SplitCriteria for InsertCompleteDisjointRegionName
Locality Sensitive Hashing (LSH)
• (r1,r2, p1, p2)-sensitive hashing
• L hash tables• Each hash table
examines k random bits
22
11
)]()([Pr ),( )]()([Pr ),(
pphqhthenrqBpifpphqhthenrqBpif
H
H
≤=∉≥=∈
ρ
ρ
)/(
)/(log/1ln/1ln
2/1
2
1
Bnl
Bnkpp
p
=
=
=
LSH Performance
LSH Performance (cnt’d)
Intrinsic Dimensionality
• Metric space (X, d)• Closed ball• Doubling dimension
– Minimum value ρ such that every set in X can be covered by 2ρ sets of half the diameter
• Expansion constant– Smallest value c ≥ 2 such that
|),(||)2,(| rpBcrpB SS ≤
}),(:{),( ryxdSyrxBS ≤∈=
Navigating Nets
• Leveled directed acyclic graph – Multiple paths may exist from top to a lower-level
point• Each consequent level “covers” the dataset on
a finer scale• Adjacent levels are connected by pointers
allowing for navigation between scales
Navigating Nets
• scale r navigation list of y is defined by}),(:{ 2/, ryzdYzL rry ⋅≤∈= γ
),( (2)),(,, (1)
satisfiesit if of -an is subset A
εε
ε
yBXyxdYyx
XnetXY
Yy∈⊆≥∈∀
⊆
U
. -2 theamong topointsnearby oflist a stores structure data the,every and scaleevery For
. of - a be ,}:2{Let
2/
2/
r
rri
YnetryYyrYnetrYi
∈Γ∈Ζ∈=Γ
Navigating Nets: Query
minimal is ),(for which return 5.2/set 4.
}),(),(:{set 3.proper is and ),()11(2 while2.
}{ and set 1.) 0 and (Input NNS-APPROX
,2/
max
zqdZzrr
rZqdyqdLyZZZqdr
yZrrXq
r
rrzZzr
rr
topr
r
∈=
+≤∈=>+
==>∈
∈U
ε
ε
Cover Trees• a leveled tree where each level is a “cover” for the level
beneath it– Nesting: – Covering tree: For every node , there exists a
node satisfying and exactly one such q is a parent of p
– Separation: For all nodes ,
1−iC
iC
1−⊂ ii CC1−∈ iCp
iCq∈ iqpd 2),( ≤
iCqp ∈, iqpd 2),( >
Cover Trees: Incremental Construct
found"parent no"return else (c) found"parent "return (iii)
)(Children into insert (ii) 2),( satisfying pick (i)
2),( and found"parent no" )1,,(Insert if (b)
}2),(:{set (a) else (3)
found"parent no"return then 2),( if )2(
}:)(Children{set (1)) level ,set covert ,(point Insert
1
1
qpqpdQq
QpdiQp
qpdQqQ
Qpd
QqqQiQp
ii
iii
ii
ii
i
≤∈
≤=−
≤∈=
>
∈=
−
−
Cover Trees: Query
),(minreturn (3)}2),(),(:{
:setcover next form (b) }:)Children({
ofchildren ofset heconsider t (a) down to from for (2)
set (1))point query ,ver treeNearest(co-Find
1
qpdQpdqpdQqQ
QqqQQ
iCQ
pT
ii
i
i
∞−∈
−
∞∞
+≤∈=
∈=
∞−∞=
Cover Trees: Batch Construct
><
⋅=><
−⋅>=<
Φ≠
−⋅=><
>Φ<Φ=
><
−
−
FAR , return (d) NEAR toNEAR-NEW and FAR toFAR-NEW add (v)
UNUSED), 2),,(SPLIT( FAR-NEWNEAR,-NEWlet (iv)
)Children( toCHILD add (iii) )1 FAR), NEAR,,2),,(SPLIT( , Construct(UNUSEDCHILD, (ii)
NEARin pick (i) NEAR while(c)
) Children( toSELF add (b) )1 NEAR), ,2),,(SPLIT( ,Construct( NEAR SELF, (a)
else (3),return then (2)
NEAR if (1)) level ,FAR NEAR, setspoint ,point Construct(
1
1
i
ii
i
i
i
p
qd
piqdq
q
pipdp
p
ip
Cover Trees: Batch Query
),(Nearest -All-Find
)(Children each for else (b)
),(Nearest -All-Find (iii)
}22),(min),(:{Set (ii)
}:)(Children{Set (i) then if (a)
else (2)
ofneighbor nearest theas ),(min argreturn
)(each for then if (1)
)set cover , cover tree(query Nearest -All-Find
1
1
1
21
ij
jj
ij
jijQqji
i
Qb
j
ij
pq
Qp
qpdqpdQqQ
QqqQij
abad
pLai
Qp
−
−
−
+∈−
∈
∈
++≤∈=
∈=<
∈−∞=
∞−
Cover Trees: Space Complexity
www.lems.brown.edu/vision/ independentStudy/Voctoria_covertree.ppt
Cover Trees vs. Navigating Nets
Batch Query
Query
Insertion/Removal
Construction Time
Construction Space
Navigating NetsCover Trees
nncO ln)1(
ncO ln)1(
nncO ln)1(
ncO ln)1(
ncO )1(
)( 16ncO)ln( 12 ncO
)ln( 6 nncO
)(nO
)ln( 6 ncO
7.56134.9814.414575174Bio_train
74.12885.538.92410000078Phy_test
52.2724.513.8675000078Phy_train
1.41.2080.8716984096Image
928.372301.377.958101255covtype
7.75672.9741.613965874Bio_test
2.44581.61944.060000784Mnist
6.237.6336.0572000017Letter
5.3203.238.5693774932Corel
2.63.8721.493382365Optdigits_B
19.46.6060.340749415Pendigits_B
3.00.8110.277179765Optdigits_A
15.11.3670.091349815Pendigits_A
3.00.0170.00635135Ionosphere
4.90.0470.0107689Pima
2.00.0040.00221410Glass
1.80.0030.00117814Wine
1.50.0070.0053457Bupa
1.20.00140.00121504Iris
Speedup factorBrute force (s)Cover tree (s)NdDataset
Cover Trees: Performance
Cover Trees: Performance (cnt’d)
* Results from Zhe Wang
Expansion Constant
Doubling Dimension*
0.471.383
10.9619922
6.43861
ρ2ρLog2(r)
0.00061.00046
0.231.175
4.6525.104
7.12139.33
5.6650.412
241
ρ2ρLog2(r)
0.00131.000915
0.0031.00214
0.00061.000413
0.0041.00312
0.011.0111
0.031.0210
0.051.049
0.131.098
0.151.117
0.401.326
0.781.725
2.435.404
7.55187.903
7.81224.082
3.1791
ρ2ρLog2(r)Audio
Image
Shape
* Results from Emily Huang
Conclusion
• Indexing high-dimensional data for similarity search is hard
• Better feature vectors and distance functions
• A hybrid approach?– Cover trees– Locality sensitive hashing– Linear scan