![Page 1: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/1.jpg)
Document duplication(exact or approximate)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Slides only!
![Page 2: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/2.jpg)
Duplicate documents
The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates
E.g., Last modified date the only difference between two copies of a page
Sec. 19.6
![Page 3: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/3.jpg)
Near-Duplicate Detection
Problem Given a large collection of documents Identify the near-duplicate documents
Web search engines Proliferation of near-duplicate documents
Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors
30% of web-pages are near-duplicates [1997]
![Page 4: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/4.jpg)
Desiderata
Storage: only small sketches of each document.
Computation: the fastest possible
Stream Processing: once sketch computed, source is
unavailable
Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do
![Page 5: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/5.jpg)
Natural Approaches
Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability
guarantees MD5 – cryptographically-secure string hashes
Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents
Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ
![Page 6: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/6.jpg)
Karp-Rabin Fingerprints
Consider – m-bit string A = 1 a1 a2 … am
Basic values: Choose a prime p in the universe U ≈ 264
Fingerprint: f(A) = A mod p
Rolling hash given B = a2 … am am+1
f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p
Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U)
< (log (A+B)) / #prime(U) ) ≈ (m log U)/U
![Page 7: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/7.jpg)
Basic Idea [Broder 1997]
Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection
[ Jaccard ] They are near-duplicates if large shingle-sets
intersect enough
![Page 8: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/8.jpg)
#1. Doc Similarity Set Intersection
DocB SB
SADocA
• Jaccard measure – similarity of SA, SB
• Claim: A & B are near-duplicates if sim(SA,SB) is high
BA
BABA SS
SS )S,sim(S
We need to cope with “Set Intersection”fingerprints of shingles (for space/time efficiency)min-hash to estimate intersections sizes (further efficiency)
![Page 9: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/9.jpg)
Multiset ofFingerprints
Doc shinglingMultiset ofShingles
fingerprint
#2. Sets of 64-bit fingerprints
Fingerprints:• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• In practice, use 64-bit fingerprints, i.e., U=264
• Prob[collision] ≈ (8q * 64)/264 << 1
This reduces space for storing the multi-setsand the time to intersect them, but...
![Page 10: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/10.jpg)
#3. Sketch of a document
Sets are large, so their intersection is still too costly
Create a “sketch vector” (of size ~200) for each shingle-set
Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates
Sec. 19.6
![Page 11: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/11.jpg)
Sketching by Min-Hashing
Consider SA, SB {0,…,p-1}
Pick a random permutation π of the whole set P (such as ax+b mod p)
Define = min{π(SA)} , = min{π(SB)} minimal element under permutation π
Lemma: BA
BA
SS
SS β]P[α
![Page 12: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/12.jpg)
Strengthening it…
Similarity sketch sk(A) = k minimal elements under π(SA)
We might also take K permutations and the min of each
Note: we can reduce the variance by using a larger k
![Page 13: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/13.jpg)
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute with i
Pick the min value
Sec. 19.6
![Page 14: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/14.jpg)
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: , ,… 200
A B
Sec. 19.6
![Page 15: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/15.jpg)
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)
Claim: This happens with probability Size_of_intersection / Size_of_union
BA
Sec. 19.6
![Page 16: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/16.jpg)
#4. Detecting all duplicates
Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is
executed in internal memory
Locality sensitive hashing (LSH) for sk(A) sk(B) Sample h elements of sk(A) as ID (may induce
false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt
same h-selection), then A and B are probably near-duplicates (hence compare).
![Page 17: Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!](https://reader036.vdocuments.us/reader036/viewer/2022083009/5697bf7b1a28abf838c83ab8/html5/thumbnails/17.jpg)
#4. do you implement this?GOAL: If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare).
SOL 1:Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()].Insert the docID in the slots of each ID (using some hash)Scan every bucket and check which docID share >=1 ID.
SOL 2:Sort by each ID, and then check the consecutive equal ones.