similarity search in high dimensions via hashing

Advanced Topics in Artificial Intelligence

Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyky, Rajeev Motwaniz

Presenter

Maruf AytekinPhD Student

Computer Engineering DepartmentBahcesehir University

Apr 21, 2015

Outline• LSH • Locality-Sensitive Functions • Banding Technique • LSH Families for Cosine • Applications of LSH • Conclusion

LSHOne general approach to LSH

• “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.

• We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.

• We check only the candidate pairs for similarity.

LSH• Most of the dissimilar pairs will never hash to the same

bucket, and therefore will never be checked. • Those dissimilar pairs that do hash to the same bucket are

false positives: a small fraction of all pairs. • We also hope that most of the truly similar pairs will hash to

the same bucket under at least one of the hash functions. • Those that do not are false negatives; only a small fraction of

the truly similar pairs.

Locality-Sensitive FunctionsIn many cases, the function f will “hash” items, and the

decision will be based on whether or not the result is equal.

• f(x) = f(y) to mean that f(x,y) is “yes; make x and y a

candidate pair.”

• f(x) ≠ f(y) to mean “do not make x and y a candidate pair.”

A collection of functions of this form will be called a family of

functions.

Locality-Sensitive FunctionsLet d1 < d2 be two distances according to some distance

measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F:

1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at

least p1.

2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at

most p2.

Locality-Sensitive Functions

Behavior of a (d1, d2, p1, p2)-sensitive function

• d1 and d2 can be made as close possible

• The penalty is that p1 and p2 becomes close as well.

Banding TechniqueAn effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each.

Dividing a signature matrix into four bands of three rows per band

Analysis of the Banding Technique

The probability that the signatures becomes candidate pair at least one band: 1 − (1 − s r ) b

This function has the form of an S-curve:

The threshold (the value of similarity s) at which the probability of becoming a candidate is 1/2, is a function of b and r (b = 16, r = 4).


Values of the S-curve for b = 20 and r = 5


• Choose a threshold t that defines how similar items have to be in order for them to be “candidate pair.”

• Pick b and r such that br = n, and the threshold t is approximately (1/b)1/r.

• If avoiding false negatives is important, select b and r to produce a threshold lower than t.

• if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.

LSH for CosineLet u be user u's rating vector and v be user v's rating vector and r is a random generated vector. The family of hash functions H:

, where

which shows the probability of u and v being declared as a candidate pair.

LSH for CosineA new family G of hash functions g is defined, where each function g is obtained by concatenating (AND) functions of h1, h2, , ...., hr from family of functions F:

g(t) = [h1(t),........, hr(t)].

We then generate random functions of g(t) for each band (hash table) and construct b hash tables.

LSH for CosineExample: r1 = [-1, 1,1,-1,-1]

r2 = [1, 1,1,-1,-1]

r3 = [-1, -1,1,-1,1]

r4 = [-1, 1, -1,1, -1]

u1.r1 = -6 => hr1(u1) = 0

u1.r2 = 4 => hr2(u1) = 1

u1.r3 = -12 => hr3(u1) = 0

u1.r4 = 2 => hr4(u1) = 1

u1 = [5, 4, 0, 4, 1] u2 = [2, 1, 1, 1, 4] u3 = [4, 3, 0, 5, 2] u4 = [2, 1, 2, 1, 4]

g(u1) = 0101

g(u2) = 0010 g(u3) = 0101 g(u4) = 0110

g(u1) = 0101

Applications of LSH• Near neighbor search • Entity Resolution • Matching Fingerprints • Matching Newspaper Articles

Thank You

Q & A