indexing for similarity search - cs.princeton.edu€¦ · reference: indexing survey • searching...

Indexing for Similarity Search

Qin Lv

COS598E Spring 2005

Outline

• Problem • High-dimensional indexing: previous study

– Quadtree, kd-tree, R-tree, … • Locality sensitive hashing (LSH)• Navigating nets• Cover trees• Conclusion

Problem• Feature vectors: points in high-

dimensional space• Range query• Nearest neighbor query

– K-nearest neighbor query– Approximate nearest neighbor query

• Linear scan• Indexing: preprocess/organize data points

in order to answer queries efficiently

Reference: Indexing Survey

• Searching in high-dimensional spaces -Index structures for improving the performance of multimedia databases– Christian Böhm, Stefan Berchtold, Daniel A. Keim– ACM Computing Surveys (CSUR) – Volume 33 , Issue 3, Pages: 322 - 373– September 2001

Quadtree

• At each level, splits a space into 2d equal subspaces– 4 subspaces in 2-dimensional space, hence the name

• Very simple data structure• But

– Empty spaces– Space exponential in d– Time exponential in d

k-d Tree

• Splits in one dimension each time • Adaptive: instead of splitting in the middle,

choose the split carefully • No (or less) empty spaces• Space linear to d• Exponential query time

still possible

R-tree• [Gut 84] Splits space using minimum bounding

rectangles (MBRs)• Insertion: starting from root, each time picks a

child region, splits a region when needed • Rules:

– p is contained in exactly one region

– p is contained in multiple regions

– No region contains p

R-tree (cnt’d)

• Basic problem: – Overlap at high index levels and propagates

down by misled insert operations

R*-tree• [BKSS 90] an extension of R-tree• Minimize overlap between regions

– Picks the data region that yields the smallest enlargement of overlap

– Picks the split plane that minimizes the overlap between regions• Minimize the surface of regions

– When splitting, picks the dimension that yields the smallest surface areas of all MBRs

• Minimize the volume covered by internal nodes– Picks the internal region that yields the smallest volume

enlargement• Maximize the storage utilization

– forced re-insert : certain percentage of points with the largest distances from the region center are deleted and re-inserted

R*-tree (cnt’d)

• 10% - 75% improvements over R-tree• In higher-dimensional spaces

– Deteriorated directory (internal nodes)– Needs to load the entire index in order to process

most queries• Heuristics to optimize for regions with smaller

surface is beneficial

R+-tree

• [SSH 86; SRF 87]• An overlap-free variant of R-tree• Uses forced-split to avoid overlap• High dimensionality leads to many forced

split operations• Low storage utilization

X-tree

• [BKK 96] An extension of R*-tree• Overlap-free split according to a split history• Supernodes with enlarged page capacity

X-tree (cnt’d)

• Small dimensions– similar performance to R-tree

• Medium dimensions – high performance gain compared to R*-tree

for all query types• High dimensions

– Also needs to visit large number of nodes– Linear scan is less expensive

SS-tree

• [WJ 96] uses spheres as regions – (centroid point, minimum radius)

• Insertion: at each level, chooses the child sphere whose centroid is closest to p

• Forced re-insert: 30% points with largest distances to centroid are deleted and re-inserted

SS-tree (cnt’d)

• Although spheres are theoretically superior to volume-equivalent MRBs– Overlap-free split is difficult for spheres

• Performance– Outperforms R*-tree by a factor of 2– Not as good as LSDh-tree and X-tree

SR-tree

• [KS 97] A combination of R*-tree and SS-tree– Region: intersection between a rectangle and

a sphere– 2d values for MBRs– d+1 values for spheres

• Insert and split operationssimilar to SS-tree

SR-tree (cnt’d)

• Reports better performance than SS-tree and R*-tree

• Probably not as good as LSDh-tree and X-tree

Space Filling Curves• Mappings from d-dimensional space to one-

dimensional space• Points that are close in original space are likely to be

close in the embedded space• Embedded space can be indexed by B-tree

noAccording to space filling curve

Unique due to complete, disjoint partition

YesYesUnion of rectangles

Space filling curves

yesVarianceProximity to centroidNoNoIntersect sphere/MBR

SR-tree

yesVarianceProximity to centroidNoNoSphereSS-tree

noCyclic change of dim.# of distinct values

Unique due to complete, disjoint partition

No/yesYesK-d-tree region

LSDh-tree

noSplit historySurface/overlapDead space coverage

Overlap enlargementVolume enlargementVolume

NoNoMBRX-tree

YesSurface areaOverlapDead space coverage

Overlap enlargementVolume enlargementVolume

NoNoMBRR*-tree

noVarious algorithmsVolume enlargementVolume

NoNoMBRR-tree

Re-insert

Criteria for SplitCriteria for InsertCompleteDisjointRegionName

Locality Sensitive Hashing (LSH)

• (r1,r2, p1, p2)-sensitive hashing

• L hash tables• Each hash table

examines k random bits

22

11

)]()([Pr ),( )]()([Pr ),(

pphqhthenrqBpifpphqhthenrqBpif

H

H

≤=∉≥=∈

ρ

ρ

)/(

)/(log/1ln/1ln

2/1

2

1

Bnl

Bnkpp

p

=

=

=

LSH Performance

LSH Performance (cnt’d)

Intrinsic Dimensionality

• Metric space (X, d)• Closed ball• Doubling dimension

– Minimum value ρ such that every set in X can be covered by 2ρ sets of half the diameter

• Expansion constant– Smallest value c ≥ 2 such that

|),(||)2,(| rpBcrpB SS ≤

}),(:{),( ryxdSyrxBS ≤∈=

Navigating Nets

• Leveled directed acyclic graph – Multiple paths may exist from top to a lower-level

point• Each consequent level “covers” the dataset on

a finer scale• Adjacent levels are connected by pointers

allowing for navigation between scales

Navigating Nets

• scale r navigation list of y is defined by}),(:{ 2/, ryzdYzL rry ⋅≤∈= γ

),( (2)),(,, (1)

satisfiesit if of -an is subset A

εε

ε

yBXyxdYyx

XnetXY

Yy∈⊆≥∈∀

⊆

U

. -2 theamong topointsnearby oflist a stores structure data the,every and scaleevery For

. of - a be ,}:2{Let

2/

2/

r

rri

YnetryYyrYnetrYi

∈Γ∈Ζ∈=Γ

Navigating Nets: Query

minimal is ),(for which return 5.2/set 4.

}),(),(:{set 3.proper is and ),()11(2 while2.

}{ and set 1.) 0 and (Input NNS-APPROX

,2/

max

zqdZzrr

rZqdyqdLyZZZqdr

yZrrXq

r

rrzZzr

rr

topr

r

∈=

+≤∈=>+

==>∈

∈U

ε

ε

Cover Trees• a leveled tree where each level is a “cover” for the level

beneath it– Nesting: – Covering tree: For every node , there exists a

node satisfying and exactly one such q is a parent of p

– Separation: For all nodes ,

1−iC

iC

1−⊂ ii CC1−∈ iCp

iCq∈ iqpd 2),( ≤

iCqp ∈, iqpd 2),( >

Cover Trees: Incremental Construct

found"parent no"return else (c) found"parent "return (iii)

)(Children into insert (ii) 2),( satisfying pick (i)

2),( and found"parent no" )1,,(Insert if (b)

}2),(:{set (a) else (3)

found"parent no"return then 2),( if )2(

}:)(Children{set (1)) level ,set covert ,(point Insert

1

1

qpqpdQq

QpdiQp

qpdQqQ

Qpd

QqqQiQp

ii

iii

ii

ii

i

≤∈

≤=−

≤∈=

>

∈=

−

−

Cover Trees: Query

),(minreturn (3)}2),(),(:{

:setcover next form (b) }:)Children({

ofchildren ofset heconsider t (a) down to from for (2)

set (1))point query ,ver treeNearest(co-Find

1

qpdQpdqpdQqQ

QqqQQ

iCQ

pT

Qq

ii

i

i

∞−∈

−

∞∞

+≤∈=

∈=

∞−∞=

Cover Trees: Batch Construct

><

⋅=><

−⋅>=<

Φ≠

−⋅=><

>Φ<Φ=

><

−

−

FAR , return (d) NEAR toNEAR-NEW and FAR toFAR-NEW add (v)

UNUSED), 2),,(SPLIT( FAR-NEWNEAR,-NEWlet (iv)

)Children( toCHILD add (iii) )1 FAR), NEAR,,2),,(SPLIT( , Construct(UNUSEDCHILD, (ii)

NEARin pick (i) NEAR while(c)

) Children( toSELF add (b) )1 NEAR), ,2),,(SPLIT( ,Construct( NEAR SELF, (a)

else (3),return then (2)

NEAR if (1)) level ,FAR NEAR, setspoint ,point Construct(

1

1

i

ii

i

i

i

p

qd

piqdq

q

pipdp

p

ip

Cover Trees: Batch Query

),(Nearest -All-Find

)(Children each for else (b)

),(Nearest -All-Find (iii)

}22),(min),(:{Set (ii)

}:)(Children{Set (i) then if (a)

else (2)

ofneighbor nearest theas ),(min argreturn

)(each for then if (1)

)set cover , cover tree(query Nearest -All-Find

1

1

1

21

ij

jj

ij

jijQqji

i

Qb

j

ij

Qq

pq

Qp

qpdqpdQqQ

QqqQij

abad

pLai

Qp

−

−

−

+∈−

∈

∈

++≤∈=

∈=<

∈−∞=

∞−

Cover Trees: Space Complexity

www.lems.brown.edu/vision/ independentStudy/Voctoria_covertree.ppt

Cover Trees vs. Navigating Nets

Batch Query

Query

Insertion/Removal

Construction Time

Construction Space

Navigating NetsCover Trees

nncO ln)1(

ncO ln)1(

nncO ln)1(

ncO ln)1(

ncO )1(

)( 16ncO)ln( 12 ncO

)ln( 6 nncO

)(nO

)ln( 6 ncO

7.56134.9814.414575174Bio_train

74.12885.538.92410000078Phy_test

52.2724.513.8675000078Phy_train

1.41.2080.8716984096Image

928.372301.377.958101255covtype

7.75672.9741.613965874Bio_test

2.44581.61944.060000784Mnist

6.237.6336.0572000017Letter

5.3203.238.5693774932Corel

2.63.8721.493382365Optdigits_B

19.46.6060.340749415Pendigits_B

3.00.8110.277179765Optdigits_A

15.11.3670.091349815Pendigits_A

3.00.0170.00635135Ionosphere

4.90.0470.0107689Pima

2.00.0040.00221410Glass

1.80.0030.00117814Wine

1.50.0070.0053457Bupa

1.20.00140.00121504Iris

Speedup factorBrute force (s)Cover tree (s)NdDataset

Cover Trees: Performance

Cover Trees: Performance (cnt’d)

* Results from Zhe Wang

Expansion Constant

Doubling Dimension*

0.471.383

10.9619922

6.43861

ρ2ρLog2(r)

0.00061.00046

0.231.175

4.6525.104

7.12139.33

5.6650.412

241

ρ2ρLog2(r)

0.00131.000915

0.0031.00214

0.00061.000413

0.0041.00312

0.011.0111

0.031.0210

0.051.049

0.131.098

0.151.117

0.401.326

0.781.725

2.435.404

7.55187.903

7.81224.082

3.1791

ρ2ρLog2(r)Audio

Image

Shape

* Results from Emily Huang

Conclusion

• Indexing high-dimensional data for similarity search is hard

• Better feature vectors and distance functions

• A hybrid approach?– Cover trees– Locality sensitive hashing– Linear scan

indexing for similarity search - cs.princeton.edu€¦ · reference: indexing survey • searching...

Documents