finding similar items - tauamir1/seminar/lectures/finding_similar... · 2014. 11. 16. · rows of...

44
Finding Similar Items Course: Big Data Processing Professor: Amir Averbuch Student: Nave Frost 16/11/2014 Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman ,Jeffrey D. Ullman

Upload: others

Post on 10-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Finding Similar Items

Course: Big Data Processing Professor: Amir Averbuch

Student: Nave Frost 16/11/2014

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman ,Jeffrey D. Ullman

Page 2: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Word Count

• Problem Given set of Strings – Count how many times each String appears. • Naive Solution Compare each String with all the other Strings. • Solution Hash each String and compare only Strings with the same hash value.

Page 3: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Similar Sets

Problem

Given set of Sets – Find all pair of similar sets.

Page 4: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Applications

• Near duplicate Web pages

– Plagiarisms

– Mirror pages

• Collaborative filter

Page 5: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Jaccard Similarity

• Definition

The Jaccard similarity of sets 𝑆 and 𝑇 is:

𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|

|𝑆 ∪ 𝑇|

Page 6: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Jaccard Similarity

• Example 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒

𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|

|𝑆 ∪ 𝑇|=

| 𝑎, 𝑏 |

| 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 |=2

5

Page 7: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Shingling

• Definition

Any substring of length 𝑘 is called 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒.

• Example

For 𝑘 = 2 and String "𝑎𝑏𝑐𝑑𝑎𝑏𝑑“

The set of 2 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒𝑠 is {𝑎𝑏, 𝑏𝑐, 𝑐𝑑, 𝑑𝑎, 𝑏𝑑}.

Page 8: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Shingle Size

Too small 𝑘 : All documents will be similar.

Too large 𝑘 : documents will be similar only to identical documents.

Page 9: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Shingle Size

𝑘 should be picked large enough that the probability of any given shingle appearing in any given document is low.

Page 10: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Shingle Size

Example

Let 𝐸 = {𝑒1, … , 𝑒𝑁} be corpus of emails.

Assume each email contain only letters and a white-space character.

There will be 27𝑘 possible shingles.

For each 𝑒𝑖: 𝑒𝑖 ≪ 14,348,907 = 275

Hence, we would expect 𝑘 = 5 to work well.

Page 11: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Hashing Shingles

To reduce the size of the k-shingles

Use hash ℎ: 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 → 232 − 1

That maps strings of length k to Integer.

Page 12: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Signature

• Problem

Even if we hash the 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 to 4 bytes each, the space needed to store a set is still roughly four times the space taken by the document.

• Goal

1. Find Signature, i.e, smaller representation.

2. Compare the signatures of two sets to estimate the Jaccard similarity.

Page 13: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Matrix Representation

Columns : Documents

Rows : Elements

Example

𝑆1 = {𝑎, 𝑑},

𝑆2 = 𝑐 ,

𝑆3 = 𝑏, 𝑑, 𝑒 ,

𝑆4 = {𝑎, 𝑐, 𝑑}.

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 0 1 𝒂

0 1 0 0 𝒃

1 0 1 0 𝒄

0 1 0 1 𝒅

1 1 0 0 𝒆

Page 14: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Minhashing

Pick a permutation of the rows.

minhash value of a column is the first row, in the permuted order, in which the column has a 1.

Page 15: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Minhashing

Example

Permutation ℎ : 𝑏𝑒𝑎𝑑𝑐

ℎ 𝑆1 = 𝑎 ℎ 𝑆2 = 𝑐 ℎ 𝑆3 = 𝑏 ℎ 𝑆4 = 𝑎

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 0 1 𝒂

0 1 0 0 𝒃

1 0 1 0 𝒄

0 1 0 1 𝒅

1 1 0 0 𝒆

Page 16: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Minhashing

• Theoram 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀(𝑆1, 𝑆2)

• Proof 𝑋 : rows have 1 in both columns. 𝑌 : rows have 1 in one of the columns and 0 in the other. 𝑍 : rows have 0 in both columns. Denote, 𝑥 = |𝑋| 𝑦 = |𝑌|

𝑆𝐼𝑀 𝑆1, 𝑆2 =|𝑆1 ∩ 𝑆2|

|𝑆1 ∪ 𝑆2|=

𝑥

𝑥 + 𝑦

Page 17: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Minhashing

• The probability that we shall meet a type 𝑋 row

before we meet a type 𝑌 row is 𝑥

𝑥+𝑦 - In that case

ℎ 𝑆1 = ℎ(𝑆2).

• If we meet a type 𝑌 row before we meet a type 𝑋 - In that case ℎ 𝑆1 ≠ ℎ(𝑆2).

Hence, 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 =𝑥

𝑥+𝑦

Page 18: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Locality Sensitive Hashing

Generate from the collection of all elements (signatures in our example) a small list of candidate pairs: pairs of elements whose similarity must be evaluated.

Page 19: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Signature Matrix

For 2 sets 𝑆1and 𝑆2 such that SIM 𝑆1, 𝑆2 = 0.8

The probability that ℎ 𝑆1 ≠ ℎ(𝑆2) is 0.2.

We will have 𝑛 permutations

ℎ1, … , ℎ𝑛 and will a build

Signature Matrix 𝑀,

such that 𝑀 𝑖, 𝑗 = ℎ𝑖(𝑆𝑗).

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 3 1 𝒉𝟏

0 0 2 0 𝒉𝟐

ℎ1 𝑥 = 𝑥 + 1 𝑚𝑜𝑑 5 ℎ2 𝑥 = 3𝑥 + 1 𝑚𝑜𝑑 5

Page 20: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Candidate Generation

• Pick a similarity threshold 0 < 𝑡 < 1.

• We want a pair of columns 𝑐 and 𝑑 of the signature matrix 𝑀 to be candidate pair if and only if their signatures agree in at least fraction 𝑡 of the rows.

Page 21: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Partition into Bands

• Divide Matrix 𝑀 into 𝑏 bands and 𝑟 rows.

• For each band, hash its portion of each column to hash table with 𝑘 buckets.

• Candidate column pair are those that hash to the same bucket for ≥ 1 band.

Page 22: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Let 𝑆1 and 𝑆2 be pair of documents with SIM 𝑆1, 𝑆2 = 𝑠: 1. The probability that the signatures agree in all

rows of one particular band is 𝑠𝑟. 2. The probability that the signatures do not agree

in at least one row of a particular band is 1 − 𝑠𝑟. 3. The probability that the signatures do not agree

in all rows of any of the bands is (1 − 𝑠𝑟)𝑏. 4. The probability that the signatures agree in all

the rows of at least one band, and therefore become a candidate pair, is 1 − (1 − 𝑠𝑟)𝑏.

Analysis of Banding

Page 23: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Example • Suppose 𝑛 = 100 divided to 20 bands with 5 rows each.

• Let 𝑆1 and 𝑆2 be 80% similar.

– Probability 𝑆1, 𝑆2 identical in one particular band: 0.85 = 0.328

– Probability 𝑆1, 𝑆2 are not similar in any of the 20 bands: (1 − 0.328)20= 0.00035

• Let 𝑆1 and 𝑆2 be 40% similar. – Probability 𝑆1, 𝑆2 identical in any one particular band:

0.45 = 0.01 – Probability 𝑆1, 𝑆2 identical in at least one of the 20 bands:

≤ 20 ∗ 0.01 = 0.2

Analysis of Banding

Page 24: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Analysis of Banding

Page 25: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Definition

A distance measure 𝑑(𝑥, 𝑦) takes two points in space and produces a real number, and satisfies the following axioms:

1. 𝑑 𝑥, 𝑦 ≥ 0

2. 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦

3. 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥

4. 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦

Distance Measures

Page 26: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Definition: 𝑑 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑟𝑛

𝑖=1𝑟

Interesting Cases

• 𝑳𝟏 − 𝒏𝒐𝒓𝒎:

Manhattan distance - d 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑛𝑖=1

• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:

Euclidian distance - 𝑑 𝑥, 𝑦 = (𝑥𝑖 − 𝑦𝑖)2𝑛

𝑖=1

• 𝑳∞ − 𝒏𝒐𝒓𝒎:

Max distance - 𝑑 𝑥, 𝑦 = max𝑖|𝑥𝑖 − 𝑦𝑖|

𝐿𝑟 − 𝑛𝑜𝑟𝑚

Page 27: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Example

• 𝑳𝟏 − 𝒏𝒐𝒓𝒎: d 𝑥, 𝑦 = 4 + 3 = 7

• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:

𝑑 𝑥, 𝑦 = 42 + 32 = 5

• 𝑳∞ − 𝒏𝒐𝒓𝒎: 𝑑 𝑥, 𝑦 = max(4,3) = 4

𝐿𝑟 − 𝑛𝑜𝑟𝑚

Page 28: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

• Definition:

𝑑 𝑆𝑖 , 𝑆𝑗 = 1 − 𝑆𝐼𝑀 𝑆𝑖 , 𝑆𝑗

• Example: 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒

𝑑 𝑆, 𝑇 = 1 − 𝑆𝐼𝑀 𝑆, 𝑇 = 1 −2

5=3

5

Jaccard Distance (Sets)

Page 29: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

• Definition:

𝑑 𝑣𝑖 , 𝑣𝑗 = Angle between the vectors

• Example:

Cosine distance (Vectors)

Page 30: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

• Definition:

𝑑 𝑆𝑡𝑟𝑖 , 𝑆𝑡𝑟𝑗 =Number of inserts and deletes to change one string into

another • Example:

𝑑 "kitten", "sitting" = 5 Delete k at 0 Insert s at 0 Delete e at 4 Insert i at 4 Insert g at 6

Edit Distance (Strings)

Page 31: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

• Definition:

𝑑 𝑣𝑖 , 𝑣𝑗 =Number of positions in which they differ

• Example: 𝑣1 𝑣2

𝑑 𝑣1, 𝑣2 = 2

Hamming Distance (Bit Vectors)

1 1 0 1 0 0 1

1 0 0 1 0 1 1

Page 32: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Locality-Sensitive Functions

Definition:

Let 𝑑1 < 𝑑2 be two distances according to some distance measure 𝑑.

Family of functions 𝐻 is said to be (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 if for every 𝑓 ∈ 𝐻:

1. If 𝑑 𝑥, 𝑦 ≤ 𝑑1, then Pr f x = f y ≥ 𝑝1.

2. If 𝑑 𝑥, 𝑦 ≥ 𝑑2, then Pr f x = f y ≤ 𝑝2.

Page 33: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Locality-Sensitive Functions

Page 34: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Locality-Sensitive Functions

Example:

The family of minhash functions is a 𝑑1, 𝑑2, 1 − 𝑑1, 1 − 𝑑2 − 𝑠𝑒𝑛𝑠𝑒𝑡𝑖𝑣𝑒 family for

any 𝑑1 and 𝑑2, where 0 ≤ 𝑑1 ≤ 𝑑2 ≤ 1.

Recall that: 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀 𝑆1, 𝑆2

= 1 − 𝑑(𝑥, 𝑦)

Page 35: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Improvements

Goal:

Given 𝐻: (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

Generate 𝐻′: (𝑑1, 𝑑2, 𝑝1

′, 𝑝2′) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

and 𝑝1′ ≈ 1, 𝑝2

′ ≈ 0

Page 36: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

AND Construction

• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is:

(𝑑1, 𝑑2, 𝑝1𝑟 , 𝑝2

𝑟) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎ𝑟} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for all 𝑖.

Page 37: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

OR Construction

• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is: (𝑑1, 𝑑2, 1 − (1 − 𝑝1)

𝑏 , 1 − (1 − 𝑝2)𝑏) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎb} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for some 𝑖.

Page 38: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Composing Constructions

We can cascade AND and OR constructions in any order to make 𝑝2 close to 0 and 𝑝1 close to 1.

Page 39: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Composing Constructions

Example:

𝐻𝐴𝑁𝐷

H1𝑂𝑅H2

AND-construction with 𝑟 = 3.

OR-construction with 𝑏 = 5.

Member of 𝐻2 built from 15 members of 𝐻.

𝟏 − (𝟏 − 𝒑𝟑)𝟓 𝒑𝟑 𝒑

0.039 0.008 0.2

0.127 0.027 0.3

0.282 0.064 0.4

0.4874 0.125 0.5

0.704 0.216 0.6

0.878 0.343 0.7

0.972 0.512 0.8

0.999 0.729 0.9

Page 40: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Composing Constructions

Example:

𝐻𝑂𝑅H1

𝐴𝑁𝐷H2

AND-construction with 𝑟 = 5.

OR-construction with 𝑏 = 3.

Member of 𝐻2 built from 15 members of 𝐻.

((𝟏 − 𝟏 − 𝒑 𝟑))𝟓 𝟏 − (𝟏 − 𝒑)𝟑 𝒑

0.028 0.488 0.2

0.122 0.657 0.3

0.296 0.784 0.4

0.513 0.875 0.5

0.718 0.936 0.6

0.872 0.973 0.7

0.961 0.992 0.8

0.995 0.999 0.9

Page 41: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

LSH For Hamming Distance

ℎ(𝑥, 𝑦) - Hamming distance between vectors x and y in 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 space.

Define: 𝑓𝑖 𝑥 = 𝑥[𝑖]

Hence, 𝑓𝑖 𝑥 = 𝑓𝑖 y if and only if 𝑥 𝑖 = 𝑦[𝑖].

Pr 𝑓𝑖 𝑥 = 𝑓𝑖 𝑦 = 1 − ℎ 𝑥,𝑦𝑑

{𝑓1, … 𝑓𝑑} is (𝑑1, 𝑑2, 1 −𝑑1𝑑, 1 − 𝑑2

𝑑) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 family.

Page 42: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

LSH For Cosine Distance

Random Hyperplanes

To pick a random hyperplane, we pick a random vector 𝑣.

The hyperplane is then the set of points whose dot product with 𝑣 is 0.

𝑓v 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑥)

Page 43: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

LSH For Cosine Distance

𝑓v 𝑥 = 𝑓v y

𝑆𝑖𝑔𝑛 𝑣 ∙ 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑦)

𝑥 and 𝑦 on the same side of the hyperplane

(𝑑1, 𝑑2,180−𝑑1180 , 180−𝑑2180 ) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

Page 44: Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of one particular band is 𝑟. 2. The probability that the signatures do not agree

Thanks

Any Questions?