similarity search in high dimensions via hashing
TRANSCRIPT
Advanced Topics in Artificial Intelligence
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyky, Rajeev Motwaniz
Presenter
Maruf AytekinPhD Student
Computer Engineering DepartmentBahcesehir University
Apr 21, 2015
Outline• LSH • Locality-Sensitive Functions • Banding Technique • LSH Families for Cosine • Applications of LSH • Conclusion
LSHOne general approach to LSH
• “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.
• We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.
• We check only the candidate pairs for similarity.
LSH• Most of the dissimilar pairs will never hash to the same
bucket, and therefore will never be checked. • Those dissimilar pairs that do hash to the same bucket are
false positives: a small fraction of all pairs. • We also hope that most of the truly similar pairs will hash to
the same bucket under at least one of the hash functions. • Those that do not are false negatives; only a small fraction of
the truly similar pairs.
Locality-Sensitive FunctionsIn many cases, the function f will “hash” items, and the
decision will be based on whether or not the result is equal.
• f(x) = f(y) to mean that f(x,y) is “yes; make x and y a
candidate pair.”
• f(x) ≠ f(y) to mean “do not make x and y a candidate pair.”
A collection of functions of this form will be called a family of
functions.
Locality-Sensitive FunctionsLet d1 < d2 be two distances according to some distance
measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F:
1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at
least p1.
2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at
most p2.
Locality-Sensitive Functions
Behavior of a (d1, d2, p1, p2)-sensitive function
• d1 and d2 can be made as close possible
• The penalty is that p1 and p2 becomes close as well.
Banding TechniqueAn effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each.
Dividing a signature matrix into four bands of three rows per band
Analysis of the Banding Technique
The probability that the signatures becomes candidate pair at least one band: 1 − (1 − s r ) b
This function has the form of an S-curve:
The threshold (the value of similarity s) at which the probability of becoming a candidate is 1/2, is a function of b and r (b = 16, r = 4).
Analysis of the Banding Technique
Values of the S-curve for b = 20 and r = 5
Analysis of the Banding Technique
• Choose a threshold t that defines how similar items have to be in order for them to be “candidate pair.”
• Pick b and r such that br = n, and the threshold t is approximately (1/b)1/r.
• If avoiding false negatives is important, select b and r to produce a threshold lower than t.
• if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
LSH for CosineLet u be user u's rating vector and v be user v's rating vector and r is a random generated vector. The family of hash functions H:
, where
which shows the probability of u and v being declared as a candidate pair.
LSH for CosineA new family G of hash functions g is defined, where each function g is obtained by concatenating (AND) functions of h1, h2, , ...., hr from family of functions F:
g(t) = [h1(t),........, hr(t)].
We then generate random functions of g(t) for each band (hash table) and construct b hash tables.
LSH for CosineExample: r1 = [-1, 1,1,-1,-1]
r2 = [1, 1,1,-1,-1]
r3 = [-1, -1,1,-1,1]
r4 = [-1, 1, -1,1, -1]
u1.r1 = -6 => hr1(u1) = 0
u1.r2 = 4 => hr2(u1) = 1
u1.r3 = -12 => hr3(u1) = 0
u1.r4 = 2 => hr4(u1) = 1
u1 = [5, 4, 0, 4, 1] u2 = [2, 1, 1, 1, 4] u3 = [4, 3, 0, 5, 2] u4 = [2, 1, 2, 1, 4]
g(u1) = 0101
g(u2) = 0010 g(u3) = 0101 g(u4) = 0110
g(u1) = 0101
Applications of LSH• Near neighbor search • Entity Resolution • Matching Fingerprints • Matching Newspaper Articles
Thank You
Q & A