finding similar items - tauamir1/seminar/lectures/finding_similar... · 2014. 11. 16. · rows of...

Finding Similar Items

Course: Big Data Processing Professor: Amir Averbuch

Student: Nave Frost 16/11/2014

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman ,Jeffrey D. Ullman

Word Count

• Problem Given set of Strings – Count how many times each String appears. • Naive Solution Compare each String with all the other Strings. • Solution Hash each String and compare only Strings with the same hash value.

Similar Sets

Problem

Given set of Sets – Find all pair of similar sets.

Applications

• Near duplicate Web pages

– Plagiarisms

– Mirror pages

• Collaborative filter

Jaccard Similarity

• Definition

The Jaccard similarity of sets 𝑆 and 𝑇 is:

𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|

|𝑆 ∪ 𝑇|

Jaccard Similarity

• Example 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒

𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|

|𝑆 ∪ 𝑇|=

| 𝑎, 𝑏 |

| 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 |=2

Shingling

• Definition

Any substring of length 𝑘 is called 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒.

• Example

For 𝑘 = 2 and String "𝑎𝑏𝑐𝑑𝑎𝑏𝑑“

The set of 2 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒𝑠 is {𝑎𝑏, 𝑏𝑐, 𝑐𝑑, 𝑑𝑎, 𝑏𝑑}.

Shingle Size

Too small 𝑘 : All documents will be similar.

Too large 𝑘 : documents will be similar only to identical documents.

Shingle Size

𝑘 should be picked large enough that the probability of any given shingle appearing in any given document is low.

Shingle Size

Example

Let 𝐸 = {𝑒1, … , 𝑒𝑁} be corpus of emails.

Assume each email contain only letters and a white-space character.

There will be 27𝑘 possible shingles.

For each 𝑒𝑖: 𝑒𝑖 ≪ 14,348,907 = 275

Hence, we would expect 𝑘 = 5 to work well.

Hashing Shingles

To reduce the size of the k-shingles

Use hash ℎ: 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 → 232 − 1

That maps strings of length k to Integer.

Signature

• Problem

Even if we hash the 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 to 4 bytes each, the space needed to store a set is still roughly four times the space taken by the document.

• Goal

1. Find Signature, i.e, smaller representation.

2. Compare the signatures of two sets to estimate the Jaccard similarity.

Matrix Representation

Columns : Documents

Rows : Elements

Example

𝑆1 = {𝑎, 𝑑},

𝑆2 = 𝑐 ,

𝑆3 = 𝑏, 𝑑, 𝑒 ,

𝑆4 = {𝑎, 𝑐, 𝑑}.

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 0 1 𝒂

0 1 0 0 𝒃

1 0 1 0 𝒄

0 1 0 1 𝒅

1 1 0 0 𝒆

Minhashing

Pick a permutation of the rows.

minhash value of a column is the first row, in the permuted order, in which the column has a 1.

Minhashing

Example

Permutation ℎ : 𝑏𝑒𝑎𝑑𝑐

ℎ 𝑆1 = 𝑎 ℎ 𝑆2 = 𝑐 ℎ 𝑆3 = 𝑏 ℎ 𝑆4 = 𝑎

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 0 1 𝒂

0 1 0 0 𝒃

1 0 1 0 𝒄

0 1 0 1 𝒅

1 1 0 0 𝒆

Minhashing

• Theoram 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀(𝑆1, 𝑆2)

• Proof 𝑋 : rows have 1 in both columns. 𝑌 : rows have 1 in one of the columns and 0 in the other. 𝑍 : rows have 0 in both columns. Denote, 𝑥 = |𝑋| 𝑦 = |𝑌|

𝑆𝐼𝑀 𝑆1, 𝑆2 =|𝑆1 ∩ 𝑆2|

|𝑆1 ∪ 𝑆2|=

𝑥 + 𝑦

Minhashing

• The probability that we shall meet a type 𝑋 row

before we meet a type 𝑌 row is 𝑥

𝑥+𝑦 - In that case

ℎ 𝑆1 = ℎ(𝑆2).

• If we meet a type 𝑌 row before we meet a type 𝑋 - In that case ℎ 𝑆1 ≠ ℎ(𝑆2).

Hence, 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 =𝑥

𝑥+𝑦

Locality Sensitive Hashing

Generate from the collection of all elements (signatures in our example) a small list of candidate pairs: pairs of elements whose similarity must be evaluated.

Signature Matrix

For 2 sets 𝑆1and 𝑆2 such that SIM 𝑆1, 𝑆2 = 0.8

The probability that ℎ 𝑆1 ≠ ℎ(𝑆2) is 0.2.

We will have 𝑛 permutations

ℎ1, … , ℎ𝑛 and will a build

Signature Matrix 𝑀,

such that 𝑀 𝑖, 𝑗 = ℎ𝑖(𝑆𝑗).

𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏

1 0 3 1 𝒉𝟏

0 0 2 0 𝒉𝟐

ℎ1 𝑥 = 𝑥 + 1 𝑚𝑜𝑑 5 ℎ2 𝑥 = 3𝑥 + 1 𝑚𝑜𝑑 5

Candidate Generation

• Pick a similarity threshold 0 < 𝑡 < 1.

• We want a pair of columns 𝑐 and 𝑑 of the signature matrix 𝑀 to be candidate pair if and only if their signatures agree in at least fraction 𝑡 of the rows.

Partition into Bands

• Divide Matrix 𝑀 into 𝑏 bands and 𝑟 rows.

• For each band, hash its portion of each column to hash table with 𝑘 buckets.

• Candidate column pair are those that hash to the same bucket for ≥ 1 band.

Let 𝑆1 and 𝑆2 be pair of documents with SIM 𝑆1, 𝑆2 = 𝑠: 1. The probability that the signatures agree in all

rows of one particular band is 𝑠𝑟. 2. The probability that the signatures do not agree

in at least one row of a particular band is 1 − 𝑠𝑟. 3. The probability that the signatures do not agree

in all rows of any of the bands is (1 − 𝑠𝑟)𝑏. 4. The probability that the signatures agree in all

the rows of at least one band, and therefore become a candidate pair, is 1 − (1 − 𝑠𝑟)𝑏.

Analysis of Banding

Example • Suppose 𝑛 = 100 divided to 20 bands with 5 rows each.

• Let 𝑆1 and 𝑆2 be 80% similar.

– Probability 𝑆1, 𝑆2 identical in one particular band: 0.85 = 0.328

– Probability 𝑆1, 𝑆2 are not similar in any of the 20 bands: (1 − 0.328)20= 0.00035

• Let 𝑆1 and 𝑆2 be 40% similar. – Probability 𝑆1, 𝑆2 identical in any one particular band:

0.45 = 0.01 – Probability 𝑆1, 𝑆2 identical in at least one of the 20 bands:

≤ 20 ∗ 0.01 = 0.2

Analysis of Banding

Definition

A distance measure 𝑑(𝑥, 𝑦) takes two points in space and produces a real number, and satisfies the following axioms:

1. 𝑑 𝑥, 𝑦 ≥ 0

2. 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦

3. 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥

4. 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦

Distance Measures

Definition: 𝑑 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑟𝑛

𝑖=1𝑟

Interesting Cases

• 𝑳𝟏 − 𝒏𝒐𝒓𝒎:

Manhattan distance - d 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑛𝑖=1

• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:

Euclidian distance - 𝑑 𝑥, 𝑦 = (𝑥𝑖 − 𝑦𝑖)2𝑛

𝑖=1

• 𝑳∞ − 𝒏𝒐𝒓𝒎:

Max distance - 𝑑 𝑥, 𝑦 = max𝑖|𝑥𝑖 − 𝑦𝑖|

𝐿𝑟 − 𝑛𝑜𝑟𝑚

Example

• 𝑳𝟏 − 𝒏𝒐𝒓𝒎: d 𝑥, 𝑦 = 4 + 3 = 7

• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:

𝑑 𝑥, 𝑦 = 42 + 32 = 5

• 𝑳∞ − 𝒏𝒐𝒓𝒎: 𝑑 𝑥, 𝑦 = max(4,3) = 4

𝐿𝑟 − 𝑛𝑜𝑟𝑚

• Definition:

𝑑 𝑆𝑖 , 𝑆𝑗 = 1 − 𝑆𝐼𝑀 𝑆𝑖 , 𝑆𝑗

• Example: 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒

𝑑 𝑆, 𝑇 = 1 − 𝑆𝐼𝑀 𝑆, 𝑇 = 1 −2

Jaccard Distance (Sets)

• Definition:

𝑑 𝑣𝑖 , 𝑣𝑗 = Angle between the vectors

• Example:

Cosine distance (Vectors)

• Definition:

𝑑 𝑆𝑡𝑟𝑖 , 𝑆𝑡𝑟𝑗 =Number of inserts and deletes to change one string into

another • Example:

𝑑 "kitten", "sitting" = 5 Delete k at 0 Insert s at 0 Delete e at 4 Insert i at 4 Insert g at 6

Edit Distance (Strings)

• Definition:

𝑑 𝑣𝑖 , 𝑣𝑗 =Number of positions in which they differ

• Example: 𝑣1 𝑣2

𝑑 𝑣1, 𝑣2 = 2

Hamming Distance (Bit Vectors)

1 1 0 1 0 0 1

1 0 0 1 0 1 1

Locality-Sensitive Functions

Definition:

Let 𝑑1 < 𝑑2 be two distances according to some distance measure 𝑑.

Family of functions 𝐻 is said to be (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 if for every 𝑓 ∈ 𝐻:

1. If 𝑑 𝑥, 𝑦 ≤ 𝑑1, then Pr f x = f y ≥ 𝑝1.

2. If 𝑑 𝑥, 𝑦 ≥ 𝑑2, then Pr f x = f y ≤ 𝑝2.

Locality-Sensitive Functions

Example:

The family of minhash functions is a 𝑑1, 𝑑2, 1 − 𝑑1, 1 − 𝑑2 − 𝑠𝑒𝑛𝑠𝑒𝑡𝑖𝑣𝑒 family for

any 𝑑1 and 𝑑2, where 0 ≤ 𝑑1 ≤ 𝑑2 ≤ 1.

Recall that: 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀 𝑆1, 𝑆2

= 1 − 𝑑(𝑥, 𝑦)

Improvements

Given 𝐻: (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

Generate 𝐻′: (𝑑1, 𝑑2, 𝑝1

′, 𝑝2′) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

and 𝑝1′ ≈ 1, 𝑝2

′ ≈ 0

AND Construction

• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is:

(𝑑1, 𝑑2, 𝑝1𝑟 , 𝑝2

𝑟) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎ𝑟} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for all 𝑖.

OR Construction

• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is: (𝑑1, 𝑑2, 1 − (1 − 𝑝1)

𝑏 , 1 − (1 − 𝑝2)𝑏) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎb} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for some 𝑖.

Composing Constructions

We can cascade AND and OR constructions in any order to make 𝑝2 close to 0 and 𝑝1 close to 1.

Example:

𝐻𝐴𝑁𝐷

H1𝑂𝑅H2

AND-construction with 𝑟 = 3.

OR-construction with 𝑏 = 5.

Member of 𝐻2 built from 15 members of 𝐻.

𝟏 − (𝟏 − 𝒑𝟑)𝟓 𝒑𝟑 𝒑

0.039 0.008 0.2

0.127 0.027 0.3

0.282 0.064 0.4

0.4874 0.125 0.5

0.704 0.216 0.6

0.878 0.343 0.7

0.972 0.512 0.8

0.999 0.729 0.9

Example:

𝐻𝑂𝑅H1

𝐴𝑁𝐷H2

AND-construction with 𝑟 = 5.

OR-construction with 𝑏 = 3.

Member of 𝐻2 built from 15 members of 𝐻.

((𝟏 − 𝟏 − 𝒑 𝟑))𝟓 𝟏 − (𝟏 − 𝒑)𝟑 𝒑

0.028 0.488 0.2

0.122 0.657 0.3

0.296 0.784 0.4

0.513 0.875 0.5

0.718 0.936 0.6

0.872 0.973 0.7

0.961 0.992 0.8

0.995 0.999 0.9

LSH For Hamming Distance

ℎ(𝑥, 𝑦) - Hamming distance between vectors x and y in 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 space.

Define: 𝑓𝑖 𝑥 = 𝑥[𝑖]

Hence, 𝑓𝑖 𝑥 = 𝑓𝑖 y if and only if 𝑥 𝑖 = 𝑦[𝑖].

Pr 𝑓𝑖 𝑥 = 𝑓𝑖 𝑦 = 1 − ℎ 𝑥,𝑦𝑑

{𝑓1, … 𝑓𝑑} is (𝑑1, 𝑑2, 1 −𝑑1𝑑, 1 − 𝑑2

𝑑) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 family.

LSH For Cosine Distance

Random Hyperplanes

To pick a random hyperplane, we pick a random vector 𝑣.

The hyperplane is then the set of points whose dot product with 𝑣 is 0.

𝑓v 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑥)

LSH For Cosine Distance

𝑓v 𝑥 = 𝑓v y

𝑆𝑖𝑔𝑛 𝑣 ∙ 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑦)

𝑥 and 𝑦 on the same side of the hyperplane

(𝑑1, 𝑑2,180−𝑑1180 , 180−𝑑2180 ) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒

Thanks

Any Questions?

finding similar items - tauamir1/seminar/lectures/finding_similar... · 2014. 11. 16. · rows of...

Documents

rows 26–33 rows 26–35 rows 10–25 rows rows 1–9 rows...

lincolnconservationdistrict.files.wordpress.com...3 rows...

tw52 school of engineering beng (hons) mechanical ... · +...

average run length performance for multivariate ...𝑟=0.20...

forest hills dayflower wrap - cascade yarns · flower...

crochet chevron blanket | crochet...stripe pat 4 rows a, 4...

deleting duplicate rows

issues arising from the preliminary conditioning …2173 a...

seaf series datasheet - samtec...

analytical tools for timing equity factors · 2019. 7....

crochet chevron blanket | crochet · 1 chevron repeat = 6"...

baby unicorn · 2017-07-09 · knit on odd numbered rows;...

some stochastic functional differential equations...

the discrete diffraction transform -...

electronic supplemental materials 𝑟 𝐴 𝑅 𝑒 - rsc

o h c n o p d te it n k - trendinglifestyle.com.au · 1...

refinery29 · day 1 15 rows day 6 10 rows day 11 15 rows 15...

trading derivative products derivative... · 2016. 4....

rows and seats simplified 2017 copy · 8ïlstariight tm...

theatre rows