locality-sensitive hashing: theory and applicationsludo.mit.edu › ~ludo ›...
TRANSCRIPT
![Page 1: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/1.jpg)
Locality-Sensitive Hashing: Theory and Applications
Ludwig Schmidt MIT → Google Brain → UC Berkeley
Based on joint works with Alex Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Kunal Talwar.
![Page 2: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/2.jpg)
Nearest Neighbor Search
1. Build a data structure for a given set of points in Rd.
2. (Repeated) For a given query, find the closest point in the dataset with “good” probability.
Main goal: fast queries with high accuracy.
![Page 3: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/3.jpg)
MotivationLarge number of applications
Data retrieval • Images (SIFT, …) • Text (tf-idf, …) • Audio (i-vectors, …)
Input Output
![Page 4: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/4.jpg)
MotivationLarge number of applications
Data retrieval • Images (SIFT, …) • Text (tf-idf, …) • Audio (i-vectors, …)
Sub-routine in other algorithms • Optimization • Cryptanalysis • Classification
Input Output
![Page 5: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/5.jpg)
A Simple Problem?
In 2D: Voronoi diagram • O(n log n) setup time • O(log n) query time
Almost ideal data structure!
![Page 6: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/6.jpg)
A Simple Problem?
In 2D: Voronoi diagram • O(n log n) setup time • O(log n) query time
Almost ideal data structure!
Problem?Many applications require high-dimensional spaces.
“Curse of dimensionality”Rd
![Page 7: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/7.jpg)
Rescue from high Dimensionality
Real data often has structure.
Example: significant gap between distance from query to • nearest neighbor• average point
![Page 8: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/8.jpg)
Common Methods
Tree data structures
Locality-sensitive hashing
Vector quantization
Nearest-neighbor graphs
![Page 9: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/9.jpg)
1. Locality-Sensitive Hashing
2. Cross-Polytope Hash
3. LSH in Neural Networks
![Page 10: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/10.jpg)
1. Locality-Sensitive Hashing
2. Cross-Polytope Hash
3. LSH in Neural Networks
![Page 11: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/11.jpg)
Formal Problem
Spherical case• Points are on the unit sphere. • Angles between most points
around 90°.
![Page 12: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/12.jpg)
Formal Problem
Spherical case• Points are on the unit sphere. • Angles between most points
around 90°.
Why?• Theory: general case reduces to this case. • Practice: good model for many (pre-processed) instances.
![Page 13: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/13.jpg)
Formal Problem
QueryNearest
Neighbor
↵
Similarity Measures (all equivalent here):
• Cosine Similarity • Angular Distance • Euclidean Distance
Why? Often (approximately) the goal in practice.
![Page 14: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/14.jpg)
Locality-Sensitive HashingIntroduced in [Indyk, Motwani, 1998].
Main idea: partition Rd randomly such that nearby points are more likely to appear in the same cell of the partition.
![Page 15: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/15.jpg)
Locality-Sensitive HashingIntroduced in [Indyk, Motwani, 1998].
Main idea: partition Rd randomly such that nearby points are more likely to appear in the same cell of the partition.
What about hash functions?
Random hash function = random space partitioning
![Page 16: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/16.jpg)
LSH: Formal DefinitionA family of hash functions is (r, c r, p1, p2)-locality sensitive if for every p, q in Rd, the following holds:
• if || p - q || < r, then P[h(p) = h(q)] > p1 • if || p - q || > c r, then P[h(p) = h(q)] < p2
r cr
p2
p1
![Page 17: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/17.jpg)
LSH: Data StructureMultiple hash tables with independent hash functions.
Query1. Find candidates in hash buckets h(q). 2. Compute exact distances for candidates.
![Page 18: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/18.jpg)
LSH: TheoryQuery time quantified as a function of sensitivity ρ.
Intuitively: gap between “nearby” collision probability and “far away” collision probability.
⇢ =log 1/p1log 1/p2Formally:
![Page 19: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/19.jpg)
LSH: TheoryQuery time quantified as a function of sensitivity ρ.
Intuitively: gap between “nearby” collision probability and “far away” collision probability.
⇢ =log 1/p1log 1/p2Formally:
Query time is O(nρ), data structure size O(n1+ρ).
For example: often ρ ≈ 1/c where “nearby” is distance r and “far away” is distance c r.
![Page 20: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/20.jpg)
Most Common LSH Family
Hyperplane LSH
Introduced in [Charikar 2002], inspired by [Goemans, Williamson 1995].
Hash function: partition the sphere with a random hyperplane.
![Page 21: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/21.jpg)
Most Common LSH Family
Hyperplane LSH
Introduced in [Charikar 2002], inspired by [Goemans, Williamson 1995].
Hash function: partition the sphere with a random hyperplane.
Locality-sensitive: let be the angle between points p and q:↵
P [h(p) = h(q)] = 1� ↵
⇡
![Page 22: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/22.jpg)
Empirical State of the Art, 2015
GloVe dataset (word embeddings)
100-dim 1.2 mio vectors
Source: ann-benchmarks Erik Bernhardsson
![Page 23: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/23.jpg)
Empirical State of the Art, 2015
GloVe dataset (word embeddings)
100-dim 1.2 mio vectors
Source: ann-benchmarks Erik Bernhardsson
![Page 24: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/24.jpg)
![Page 25: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/25.jpg)
Empirical State of the Art, 2015
SIFT dataset (word embeddings)
128-dim 1 mio vectors
Source: ann-benchmarks Erik Bernhardsson
![Page 26: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/26.jpg)
Empirical State of the Art, 2015
SIFT dataset (word embeddings)
128-dim 1 mio vectors
Source: ann-benchmarks Erik Bernhardsson
![Page 27: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/27.jpg)
![Page 28: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/28.jpg)
Annoy (Approximate Nearest Neighbors Oh Yeah)
“Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.” (at Spotify in 2013)
Empirical State of the Art, 2015
![Page 29: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/29.jpg)
Annoy (Approximate Nearest Neighbors Oh Yeah)
“Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.” (at Spotify in 2013)
Empirical State of the Art, 2015
Algorithm: Hybrid between hyperplane hash and kd-tree.
![Page 30: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/30.jpg)
Progress Since Hyperplane
Algorithm ρ
Hyperplane hash [Charikar 2002] 1 / c
Andoni, Indyk 2006 1 / c2
Andoni, Indyk, Nguyen, Razenshteyn 2014 7 / 8c2
Voronoi hash [Andoni, Razenshteyn 2015] 1 / 2c2
Query time is O(nρ), c is the gap between “near” and “far”.
For near = 45°: exponent 0.42 (hyperplane) vs 0.18 [AR’15].
![Page 31: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/31.jpg)
Voronoi Hash
For each hash function, sample T random unit vectors g1, g2, … gT.
To hash a given point p, find the closest gi.
Hash function
![Page 32: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/32.jpg)
Cost of Hash Computation
The Voronoi hash requires many inner products for good sensitivity ρ.
101610121081041000.15
0.2
0.25
0.3
0.35
0.4
Number of parts T
Sens
itivity
⇢
Time complexity: O(d T)
![Page 33: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/33.jpg)
Cost of Hash Computation
The Voronoi hash requires many inner products for good sensitivity ρ.
101610121081041000.15
0.2
0.25
0.3
0.35
0.4
Number of parts T
Sens
itivity
⇢
Time complexity: O(d T)
Hyperplane hash works with a single inner product. O(d) time
![Page 34: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/34.jpg)
Cost of Hash Computation
The Voronoi hash requires many inner products for good sensitivity ρ.
101610121081041000.15
0.2
0.25
0.3
0.35
0.4
Number of parts T
Sens
itivity
⇢
Time complexity: O(d T)
Hyperplane hash works with a single inner product. O(d) time
Can we get fast hash functions with good sensitivity?
![Page 35: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/35.jpg)
1. Locality-Sensitive Hashing
2. Cross-Polytope Hash
3. LSH in Neural Networks
![Page 36: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/36.jpg)
Cross-Polytope Hash
![Page 37: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/37.jpg)
Cross-Polytope Hash
Cross-polytope = l1 unit ball
![Page 38: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/38.jpg)
Cross-Polytope Hash
Cross-polytope = l1 unit ball
Hash function
1. Apply random rotation to input point
2. Map to closest vertex of the cross-polytope
![Page 39: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/39.jpg)
Our Contributions
1. Analyze the CP hash
2. Multiprobe for the CP hash
3. Fast Implementation
![Page 40: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/40.jpg)
AnalysisWith a Gaussian random rotation, the CP hash has sensitivity ρ ≈ 1 / 2c2.
(Caveats: points on the sphere, r2 = √2.)
x = �X1
x = X1
↵x+ �y = ↵X1 + �Y1
↵x+ �y = �(↵X1 + �Y1)
![Page 41: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/41.jpg)
AnalysisWith a Gaussian random rotation, the CP hash has sensitivity ρ ≈ 1 / 2c2.
(Caveats: points on the sphere, r2 = √2.)
Establish lower bound for ρ vs #parts trade-off.
101610121081041000.15
0.2
0.25
0.3
0.35
0.4
Number of parts T
Sensitivity⇢
Cross-polytope LSH
Lower bound
x = �X1
x = X1
↵x+ �y = ↵X1 + �Y1
↵x+ �y = �(↵X1 + �Y1)
![Page 42: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/42.jpg)
MultiprobeProblem with LSH in some regimes: requires many tables.
Example: for 106 points and queries with 45°, Hyperplane LSH needs 725 tables for success probability 90%.
![Page 43: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/43.jpg)
MultiprobeProblem with LSH in some regimes: requires many tables.
Example: for 106 points and queries with 45°, Hyperplane LSH needs 725 tables for success probability 90%.
More memory than the dataset itself.
![Page 44: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/44.jpg)
Multiprobe
Idea: use multiple hash locations in the same few tables. [Lv, Josephson, Wang, Charikar, Li 2007]
Problem with LSH in some regimes: requires many tables.
![Page 45: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/45.jpg)
Multiprobe
Idea: use multiple hash locations in the same few tables. [Lv, Josephson, Wang, Charikar, Li 2007]
Problem with LSH in some regimes: requires many tables.
We develop a multiprobe scheme for the CP hash
How to score hash buckets?
![Page 46: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/46.jpg)
Fast Implementation
Algorithmic side: use fast pseudo-random rotations.
Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.
![Page 47: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/47.jpg)
Fast Implementation
Algorithmic side: use fast pseudo-random rotations.
Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.
Get 2d hash cells in O(d log d) time.
Analysis: [Kennedy, Ward 2016] [Choromanski, Fagan, Gouy-Pailler, Morvan, Sarlós, Atif 2016]
![Page 48: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/48.jpg)
Fast Implementation
Algorithmic side: use fast pseudo-random rotations.
Similar to fast JL [Ailon, Chazelle 2009]. Overall O(d log d) time for one hash function.
Get 2d hash cells in O(d log d) time.
Analysis: [Kennedy, Ward 2016] [Choromanski, Fagan, Gouy-Pailler, Morvan, Sarlós, Atif 2016]
Implementation side: C++, vectorized code (AVX), etc.
![Page 49: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/49.jpg)
Experiments vs Hyperplane
![Page 50: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/50.jpg)
Experiments on GloVe
![Page 51: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/51.jpg)
Experiments vs Annoy
![Page 52: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/52.jpg)
Library: FALCONNFast Approximate Look-up of COsine Nearest Neighbors
![Page 53: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/53.jpg)
Since then: Graph Search
![Page 54: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/54.jpg)
Since then: Graph Search
But: 1000x longer setup time
![Page 55: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/55.jpg)
1. Locality-Sensitive Hashing
2. Cross-Polytope Hash
3. LSH in Neural Networks
![Page 56: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/56.jpg)
Why Nearest Neighbors and Neural Networks?
Feature vectors for images, audio, text, etc. are now often generated by deep neural networks.
![Page 57: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/57.jpg)
Why Nearest Neighbors and Neural Networks?
Neural networks are often used with many output classes.
Language models Recommender systems
Image annotation
Feature vectors for images, audio, text, etc. are now often generated by deep neural networks.
![Page 58: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/58.jpg)
Neural Networks with Many Output Classes
f(x)
Softmax
C = #classes
Input x
Embedding computed by lower layers.
Top layer (fully connected) one vector ci per class.
![Page 59: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/59.jpg)
Neural Networks with Many Output Classes
f(x)
Softmax
C = #classes
Input x
Embedding computed by lower layers.
p(class i |x) =exp(f(x)T ci)PCj=1 exp(f(x)T cj)
Top layer (fully connected) one vector ci per class.
Softmax function:
![Page 60: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/60.jpg)
Prediction ProblemInput: the class vectors ci
Goal: given a new embedding f(x), quickly find the class vector with maximum inner product.
p(class i |x) =exp(f(x)T ci)PCj=1 exp(f(x)T cj)
![Page 61: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/61.jpg)
Prediction ProblemInput: the class vectors ci
Goal: given a new embedding f(x), quickly find the class vector with maximum inner product.
Nearest neighbor under maximum inner product “similarity”[Vijayanarasimhan, Shlens, Monga, Yagnik 2015], [Spring, Shrivastava 2016]
![Page 62: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/62.jpg)
LSH in Neural NetworksCrucial property: angle to nearest neighbor.
QueryNearest
Neighbor
↵
![Page 63: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/63.jpg)
LSH in Neural NetworksCrucial property: angle to nearest neighbor.
10x faster softmax
![Page 64: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/64.jpg)
Training For Larger Angles
loss(x, y) ⇡ �f(x)T cy
= �kf(x)k · kcyk · cos^(f(x), cy)
Goal of training the network: minimize cross-entropy loss
![Page 65: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/65.jpg)
Training For Larger Angles
loss(x, y) ⇡ �f(x)T cy
= �kf(x)k · kcyk · cos^(f(x), cy)
Goal of training the network: minimize cross-entropy loss
Constrain via batch
normalization
![Page 66: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/66.jpg)
Training For Larger Angles
loss(x, y) ⇡ �f(x)T cy
= �kf(x)k · kcyk · cos^(f(x), cy)
Goal of training the network: minimize cross-entropy loss
Constrain via batch
normalization
Constrain via projected gradient
descent
![Page 67: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/67.jpg)
Training For Larger Angles
loss(x, y) ⇡ �f(x)T cy
= �kf(x)k · kcyk · cos^(f(x), cy)
Goal of training the network: minimize cross-entropy loss
Constrain via batch
normalization
Constrain via projected gradient
descentResult:
larger angles
![Page 68: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/68.jpg)
Experiments for Softmax Normalization
We control the norm of the class vectors ci via projected gradient descent.
![Page 69: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/69.jpg)
Experiments for Softmax Normalization
We control the norm of the class vectors ci via projected gradient descent.
![Page 70: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/70.jpg)
Experiments for Softmax Normalization
We control the norm of the class vectors ci via projected gradient descent.
Angular distance instead of maximum inner product.
![Page 71: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/71.jpg)
Conclusions• Locality-sensitive hashing for nearest neighbor search. • Cross-polytope hash has good practical performance. • LSH can be used for fast inference in deep networks.
![Page 72: Locality-Sensitive Hashing: Theory and Applicationsludo.mit.edu › ~ludo › iowa_talk_2017_lsh.pdf · many inner products for good sensitivity ρ. 100 104 108 1012 1016 0.15 0.2](https://reader033.vdocuments.us/reader033/viewer/2022060503/5f1cd4f2e989f414fb06d4b3/html5/thumbnails/72.jpg)
Conclusions• Locality-sensitive hashing for nearest neighbor search. • Cross-polytope hash has good practical performance. • LSH can be used for fast inference in deep networks.
Thank You!