document duplication (exact or approximate) paolo ferragina dipartimento di informatica università...

Document duplication(exact or approximate)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Slides only!

Duplicate documents

The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

Near-Duplicate Detection

Problem Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors

30% of web-pages are near-duplicates [1997]

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing: once sketch computed, source is

unavailable

Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do

Natural Approaches

Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability

guarantees MD5 – cryptographically-secure string hashes

Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Karp-Rabin Fingerprints

Consider – m-bit string A = 1 a1 a2 … am

Basic values: Choose a prime p in the universe U ≈ 264

Fingerprint: f(A) = A mod p

Rolling hash given B = a2 … am am+1

f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p

Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U)

< (log (A+B)) / #prime(U) ) ≈ (m log U)/U

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection

[ Jaccard ] They are near-duplicates if large shingle-sets

intersect enough

#1. Doc Similarity Set Intersection

DocB SB

SADocA

• Jaccard measure – similarity of SA, SB

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BA

BABA SS

SS )S,sim(S

We need to cope with “Set Intersection”fingerprints of shingles (for space/time efficiency)min-hash to estimate intersections sizes (further efficiency)

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

#2. Sets of 64-bit fingerprints

Fingerprints:• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• In practice, use 64-bit fingerprints, i.e., U=264

• Prob[collision] ≈ (8q * 64)/264 << 1

This reduces space for storing the multi-setsand the time to intersect them, but...

#3. Sketch of a document

Sets are large, so their intersection is still too costly

Create a “sketch vector” (of size ~200) for each shingle-set

Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates

Sec. 19.6

Sketching by Min-Hashing

Consider SA, SB {0,…,p-1}

Pick a random permutation π of the whole set P (such as ax+b mod p)

Define = min{π(SA)} , = min{π(SB)} minimal element under permutation π

Lemma: BA

BA

SS

SS β]P[α

Strengthening it…

Similarity sketch sk(A) = k minimal elements under π(SA)

We might also take K permutations and the min of each

Note: we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute with i

Pick the min value

Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Sec. 19.6

However…

Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability Size_of_intersection / Size_of_union

BA

Sec. 19.6

#4. Detecting all duplicates

Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is

executed in internal memory

Locality sensitive hashing (LSH) for sk(A) sk(B) Sample h elements of sk(A) as ID (may induce

false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt

same h-selection), then A and B are probably near-duplicates (hence compare).

#4. do you implement this?GOAL: If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare).

SOL 1:Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()].Insert the docID in the slots of each ID (using some hash)Scan every bucket and check which docID share >=1 ID.

SOL 2:Sort by each ID, and then check the consecutive equal ones.