algorithmic high-dimensional geometry 1
DESCRIPTION
Algorithmic High-Dimensional Geometry 1. Alex Andoni ( Microsoft Research SVC). Prism of nearest neighbor search (NNS). H igh dimensional geometry. dimension reduction. NNS. space partitions. embeddings. …. ???. Nearest Neighbor Search (NNS). Preprocess: a set of points - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/1.jpg)
Algorithmic High-Dimensional Geometry 1
Alex Andoni(Microsoft Research SVC)
![Page 2: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/2.jpg)
Prism of nearest neighbor search (NNS)
2
High dimensional geometry dimension reduction
space partitions
embeddings
…
NNS
???
![Page 3: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/3.jpg)
Nearest Neighbor Search (NNS)
Preprocess: a set of points
Query: given a query point , report a point with the smallest distance to
𝑞
𝑝
![Page 4: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/4.jpg)
Motivation
Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure
Application areas: machine learning: k-NN rule speech/image/video/music recognition,
vector quantization, bioinformatics, etc… Distance can be:
Hamming, Euclidean, edit distance, Earth-mover
distance, etc… Primitive for other problems:
find the similar pairs in a set D, clustering…
000000011100010100000100010100011111
000000001100000100000100110100111111 𝑞
𝑝
![Page 5: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/5.jpg)
2D case
Compute Voronoi diagram Given query , perform point
location Performance:
Space: Query time:
![Page 6: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/6.jpg)
High-dimensional case
All exact algorithms degrade rapidly with the dimension
Algorithm Query time Space
Full indexing (Voronoi diagram size)
No indexing – linear scan
![Page 7: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/7.jpg)
Approximate NNS
r-near neighbor: given a new point , report a point s.t.
Randomized: a point returned with 90% probability
c-approximate
𝑐𝑟if there exists apoint at distance
q
r p
cr
![Page 8: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/8.jpg)
Heuristic for Exact NNS
r-near neighbor: given a new point , report a set with all points s.t. (each with 90%
probability)
may contain some approximate neighbors s.t.
Can filter out bad answers
q
r p
cr
c-approximate
![Page 9: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/9.jpg)
Approximation Algorithms for NNS
A vast literature: milder dependence on dimension
[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02], [Chan’02]…
little to no dependence on dimension
[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06], …
![Page 10: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/10.jpg)
Dimension Reduction
![Page 11: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/11.jpg)
Motivation If high dimension is an issue, reduce it?!
“flatten” dimension into dimension Not possible in general: packing bound But can if: for a fixed subset of
Johnson Lindenstrauss Lemma [JL’84] Application: NNS in
Trivial scan: query time Reduce to time if preprocess, where time to
reduce dimension of the query point
![Page 12: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/12.jpg)
Dimension Reduction Johnson Lindenstrauss Lemma: there is a
randomized linear map , , that preserves distance between two vectors up to factor:
with probability ( some constant) Preserves distances among points for Time to apply map:
![Page 13: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/13.jpg)
Idea: Project onto a random subspace of
dimension !
![Page 14: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/14.jpg)
1D embedding How about one dimension () ? Map
, where are iid normal (Gaussian) random variable
Why Gaussian? Stability property: is distributed as , where is also
Gaussian Equivalently: is centrally distributed, i.e., has
random direction, and projection on random direction depends only on length of
pdf =
![Page 15: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/15.jpg)
1D embedding Map ,
for any , Linear:
Want: Ok to consider since linear
Claim: for any , we have Expectation: Standard deviation:
Proof: Expectation
2 2
pdf =
![Page 16: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/16.jpg)
Full Dimension Reduction Just repeat the 1D embedding for times!
where is matrix of Gaussian random variables
Again, want to prove: for fixed with probability
![Page 17: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/17.jpg)
Concentration is distributed as
where each is distributed as Gaussian Norm
is called chi-squared distribution with degrees Fact: chi-squared very well concentrated:
Equal to with probability Akin to central limit theorem
![Page 18: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/18.jpg)
Dimension Reduction: wrap-up with high probability Beyond:
Can use instead of Gaussians [AMS’96, Ach’01, TZ’04…]
Fast JL: can compute faster than in time [AC’06, AL’08’11, DKS’10, KN’12…]
Other norms, such as ? 1-stability Cauchy distribution, but heavy tailed! Essentially no: [CS’02, BC’03, LN’04, JN’10…] But will see a useful substitute later! For points, approximation: between and
[BC03, NR10, ANN10…]
![Page 19: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/19.jpg)
Space Partitions
![Page 20: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/20.jpg)
Locality-Sensitive Hashing
Random hash function on s.t. for any points : Close when is “high”
Far when is “small”
Use several hashtables
q
p
𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)
Pr [𝑔 (𝑝 )=𝑔 (𝑞)]
𝑟 𝑐𝑟
1
𝑃 1
𝑃 2
: , where
[Indyk-Motwani’98]
q
“not-so-small”𝑃 1=¿𝑃 2=¿
𝑃1=𝑃2𝜌
![Page 21: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/21.jpg)
NNS for Euclidean space
21
Hash function is usually a concatenation of “primitive” functions:
LSH function : pick a random line , and quantize project point into
is a random Gaussian vector random in is a parameter (e.g., 4)
[Datar-Immorlica-Indyk-Mirrokni’04]
𝑝
ℓ
![Page 22: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/22.jpg)
Data structure is just hash tables: Each hash table uses a fresh random function
Hash all dataset points into the table Query:
Check for collisions in eachof the hash tables
Performance: space query time
Putting it all together
22
![Page 23: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/23.jpg)
Analysis of LSH Scheme
23
Choice of parameters ? hash tables with
Pr[collision of far pair] = Pr[collision of close pair] = Hence “repetitions” (tables) suffice!
distance
collis
ion p
robabilit
y
𝑟 𝑐𝑟
𝑃1
𝑃2
𝑃12
𝑃22
set s.t.=
=
𝑘=1𝑘=2
![Page 24: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/24.jpg)
Regular grid → grid of balls p can hit empty space, so take
more such grids until p is in a ball
Need (too) many grids of balls Start by projecting in
dimension t
Analysis gives Choice of reduced dimension
t? Tradeoff between
# hash tables, n, and Time to hash, tO(t)
Total query time: dn1/c2+o(1)
Better LSH ?
2D
p
pRt
[A-Indyk’06]
![Page 25: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/25.jpg)
x
Proof idea
Claim: , i.e.,
P(r)=probability of collision when ||p-q||=r
Intuitive proof: Projection approx preserves distances
[JL] P(r) = intersection / union P(r)≈random point u beyond the dashed
line Fact (high dimensions): the x-coordinate
of u has a nearly Gaussian distribution → P(r) exp(-A·r2)
pqr
q
P(r)
u
p
![Page 26: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/26.jpg)
Open question:
More practical variant of above hashing? Design space partitioning of that is
efficient: point location in poly(t) time qualitative: regions are “sphere-like”
[Prob. needle of length 1 is not cut]
[Prob needle of length c is not cut]
≥1/c2
𝑝
![Page 27: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/27.jpg)
Time-Space Trade-offs
[AI’06]
[KOR’98, IM’98, Pan’06]
[Ind’01, Pan’06]
Space Time Comment Reference
[DIIM’04, AI’06]
[IM’98]
querytime
space
medium medium
lowhigh
highlow
ω(1) memory lookups [AIP’06]
ω(1) memory lookups [PTW’08, PTW’10]
[MNP’06, OWZ’11]
1 mem lookup
![Page 28: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/28.jpg)
LSH Zoo
28
Hamming distance : pick a random coordinate(s)
[IM98] Manhattan distance:
: random grid [AI’06] Jaccard distance between sets:
: pick a random permutation on the universe
min-wise hashing [Bro’97,BGMZ’97]
[Cha’02,…]
To be or not to be
To Simons or not to Simons
…21102…
be toornot
Sim
ons
…01122…
be toornot
Sim
ons
…11101… …01111…
{be,not,or,to} {not,or,to, Simons}
1 1
not
not
=be,to,Simons,or,not
be to
![Page 29: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/29.jpg)
LSH in the wild
29
If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:
trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset
Further advantages: Dynamic: point insertions/deletions easy Natural to parallelize
𝐿
𝑘
safety not guaranteed
fewer false positives
fewer tables
![Page 30: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/30.jpg)
Space partitions beyond LSH
30
Data-dependent partitions… Practice:
Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…
often no guarantees Theory:
better NNS by data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn] for for
ℓ
cf. [AI’06, OWZ’10]
cf. [IM’98, OWZ’10]
![Page 31: Algorithmic High-Dimensional Geometry 1](https://reader031.vdocuments.us/reader031/viewer/2022032709/568130b0550346895d96c1e4/html5/thumbnails/31.jpg)
Recap for today
31
High dimensionalgeometry dimension reduction
space partitions
embedding
…
NNS
small dimension
sketching