Document duplication(exact or approximate)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Slides only!
Duplicate documents
The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates
E.g., Last modified date the only difference between two copies of a page
Sec. 19.6
Near-Duplicate Detection
Problem Given a large collection of documents Identify the near-duplicate documents
Web search engines Proliferation of near-duplicate documents
Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors
30% of web-pages are near-duplicates [1997]
Desiderata
Storage: only small sketches of each document.
Computation: the fastest possible
Stream Processing: once sketch computed, source is
unavailable
Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do
Natural Approaches
Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability
guarantees MD5 – cryptographically-secure string hashes
Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents
Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ
Karp-Rabin Fingerprints
Consider – m-bit string A = 1 a1 a2 … am
Basic values: Choose a prime p in the universe U ≈ 264
Fingerprint: f(A) = A mod p
Rolling hash given B = a2 … am am+1
f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p
Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U)
< (log (A+B)) / #prime(U) ) ≈ (m log U)/U
Basic Idea [Broder 1997]
Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection
[ Jaccard ] They are near-duplicates if large shingle-sets
intersect enough
#1. Doc Similarity Set Intersection
DocB SB
SADocA
• Jaccard measure – similarity of SA, SB
• Claim: A & B are near-duplicates if sim(SA,SB) is high
BA
BABA SS
SS )S,sim(S
We need to cope with “Set Intersection”fingerprints of shingles (for space/time efficiency)min-hash to estimate intersections sizes (further efficiency)
Multiset ofFingerprints
Doc shinglingMultiset ofShingles
fingerprint
#2. Sets of 64-bit fingerprints
Fingerprints:• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• In practice, use 64-bit fingerprints, i.e., U=264
• Prob[collision] ≈ (8q * 64)/264 << 1
This reduces space for storing the multi-setsand the time to intersect them, but...
#3. Sketch of a document
Sets are large, so their intersection is still too costly
Create a “sketch vector” (of size ~200) for each shingle-set
Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates
Sec. 19.6
Sketching by Min-Hashing
Consider SA, SB {0,…,p-1}
Pick a random permutation π of the whole set P (such as ax+b mod p)
Define = min{π(SA)} , = min{π(SB)} minimal element under permutation π
Lemma: BA
BA
SS
SS β]P[α
Strengthening it…
Similarity sketch sk(A) = k minimal elements under π(SA)
We might also take K permutations and the min of each
Note: we can reduce the variance by using a larger k
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute with i
Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: , ,… 200
A B
Sec. 19.6
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)
Claim: This happens with probability Size_of_intersection / Size_of_union
BA
Sec. 19.6
#4. Detecting all duplicates
Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is
executed in internal memory
Locality sensitive hashing (LSH) for sk(A) sk(B) Sample h elements of sk(A) as ID (may induce
false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt
same h-selection), then A and B are probably near-duplicates (hence compare).
#4. do you implement this?GOAL: If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare).
SOL 1:Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()].Insert the docID in the slots of each ID (using some hash)Scan every bucket and check which docID share >=1 ID.
SOL 2:Sort by each ID, and then check the consecutive equal ones.