1 lecture 18 syntactic web clustering cs 728 - 2007
Post on 20-Dec-2015
214 views
TRANSCRIPT
![Page 1: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/1.jpg)
1
Lecture 18 Syntactic Web Clustering CS 728 - 2007
![Page 2: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/2.jpg)
2
Outline
Previously Studied web clustering based on web link structure Some discussion of term-document vector spaces
Today Syntactic clustering of the web
Identifying syntactic duplicates Locality sensitive hash functions Resemblance and shingling Min-wise independent permutations The sketching model Hamming distance and Edit distance
![Page 3: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/3.jpg)
3
Motivation: Near-Duplicate Elimination Many web pages are duplicates or near-
duplicates of other pages Mirror sites FAQs, manuals, legal documents Different versions of the same document Plagiarism
Duplicates are bad for search engines Increase index size Harm quality of search results
Question: How to efficiently process the repository of crawled pages and eliminate (near)-duplicates?
![Page 4: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/4.jpg)
4
Syntactic Clustering of the Web[Broder, Glassman, Manasse, Zweig 97]
U: space of all possible documents S U: collection of documents Given sim: U × U [0,1]: a similarity measure
among documents If p,q are very similar sim(p,q) is close to 1 If p,q are very unsimilar, sim(p,q) is close to 0 Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a
normalized distance between p and q. G: a threshold graph on S:
p,q are connected by an edge iff sim(p,q) t (t = threshold)
Goal: find the connected components of G
![Page 5: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/5.jpg)
5
Main Challenges S is huge
Web has 10 billion pages Documents are not compressed
Needs many disks to store S Each sim computation is costly
Documents in S should be processed in a stream Main memory is small relative to S Cannot afford more than O(|S|) time How to create the graph G?
Naively, requires |S| passes and |S|2 similarity computations
![Page 6: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/6.jpg)
6
Sketching Schemes
T = a small set (|S| < |T| << |U|)
A sketching scheme for sim:Compression function: a randomized mapping
: U TReconstruction function: : TT [0,1]For every pair p,q, with high probability, have
((p),(q)) sim(p,q)
![Page 7: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/7.jpg)
7
Syntactic Clustering by Sketching P empty table of size |S| G empty graph on |S| nodes for i = 1,…,|S| read document pi from the stream P[i] (pi) for i = 1,…,|S| for j = 1,…,|S| if ((P[i],P[j]) t) add edge (i,j) to G output connected components of G
![Page 8: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/8.jpg)
8
Analysis Can compute sketches in one pass Table P can be stored in a single file on a single machine Creating G requires |S|2 applications of
Easier than full-fledged computations of sim Quadratic time is still a problem
Connected components algorithm is heavy but feasible Need a linear time algorithm that is approximation
Idea: Use Hashing
![Page 9: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/9.jpg)
Sketching vs Fingerprinting vs Hashing
Hashing h: k
Set Membership testing for set S of size n Desire uniform distribution over bin address k
Minimize collisions per bin – reduce lookup time Minimize hash table size n N=2k
Fingerprinting f : k Object Equality testing over set S of size n Distribution over k is irrelevant Avoid collisions altogether Tolerate larger k – typically N > n2
Sketching phi: k
Similarity testing for set S of size n Distribution over k is irrelevant Minimize collisions of dis-similar sets Minimize table size n N=2k
![Page 10: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/10.jpg)
10
Sketching via Locality Sensitive Hashing (LSH)[Indyk, Motwani, 98]
H = { h | h: U T }: a family of hash functions
H is locality sensitive w.r.t. sim if for all
p,q U, Pr[h(p) = h(q)] = sim(p,q).Probability is over random choice of h from HProbability of collision = similarity between p
and q
![Page 11: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/11.jpg)
11
Syntactic Clustering by LSH
P empty table of size |S| G empty graph on |S| nodes Choose random h for i = 1,…,|S| read document pi from the stream
P[i] h(pi) sort P and group by value output groups
![Page 12: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/12.jpg)
12
Analysis Can compute hash values in one pass Table P can be stored in a single file on a single machine Sorting and grouping takes O(|S| log |S|) simple
comparisons Each group consists of pages whose hash value is the
same By LSH property, they are likely to be similar to each other
Let’s apply this to the web and see if makes sense Need sim measure – Idea: shingling
![Page 13: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/13.jpg)
13
Shingling and Resemblance[Broder et al 97]
tokens: words, numbers, HTML tags, etc. tokenization(p): sequence of tokens produced from
document p w: a small integer Sw(p) = w-shingling of p = set all distinct contiguous
subsequences of tokenization(p) of length w. Ex: p = “a rose is a rose is a rose”, w = 4 Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) }
Possible to use multisets as well
resemblancew(p,q) =
![Page 14: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/14.jpg)
Shingling Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity
w=1 sim(SA,SB) = 0.7 SA = {a, a, a, is, is, rose, rose, rose}
SB = {a, a, a, is, is, rose, rose, flower, which}
w=2 sim(SA,SB) = 0.5
w=3 sim(SA,SB) = 0.3
Disregarding multiplicity w=1 sim(SA,SB) = 0.6
w=2 sim(SA,SB) = 0.5
w=3 sim(SA,SB) = 0.4285
![Page 15: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/15.jpg)
15
LSH for Resemblance
resemblancew(p,q) =
= a random permutation on w
induces a random order on all length w sequences of tokens also induces a random order on any subset X W
For each such subset and for each x X, Pr(min ((X)) = x) = 1/|X| LSH for resemblance: h(p) = min((Sw(p)))
Sw(p) Sw(q)
![Page 16: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/16.jpg)
16
LSH for Resemblance (cont.)
Lemma: Proof:
![Page 17: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/17.jpg)
Problems
How do we pick ?Need random choiceNeed to efficiently find min element
How many possible values ? ||w ! So need O(||w log ||w) bits to
represent at minimumStill need to compute min element
![Page 18: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/18.jpg)
Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent) H : a finite collection (family) of hash functions
mapping U ! {0...m-1} H is universal if,
for h in H picked uniformly at random, and for all x1, x2 in U, x1 x2
Pr(h(x1) = h(x2)) · 1/m
The class of hash functions hab(x) = ((a x + b) mod p) mod m
is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})
![Page 19: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/19.jpg)
Some Theory: Minwise independent
Minwise independent permutations: Sn : a finite collection (family) of permutations
mapping {1…n} to {1…n} H is minwise independent if,
for in Sn picked uniformly at random, and for X subset of {1…n}, and all x in X
Pr(min{(X)} = (x)) = 1/|X| It is actually hard to find a “compact” collection of hash
functions that is minwise independent, but we can use an approximation.
In practice – universal hashes work well!
![Page 20: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/20.jpg)
Back to similarity and resemblence
If in Sn and Sn is minwise independent then:
This suggests we could just keep one minimum value as our “sketch”, but our confidence would be low (high variance)
What we want for a sketch of size k is either use k ’s, or keep the k minimum values for one
),()()(
)()())((min))((minPr BAr
BSAS
BSASBSAS
![Page 21: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/21.jpg)
Multiple Permutations Better Variance Reduction
Instead of larger k, stick with k=1 Multiple, independent permutations
Sketch Construction Pick p random permutations of U – π1,π2, …,πp
sk(A) = minimal elements under π1(SA), …, πp(SA) Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB)
Earlier lemma true for p=1 Linearity of expectations Variance reduction – independence of π1, …,πp
![Page 22: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/22.jpg)
22
Other Known Sketching Schemes
Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]
Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] Edit distance [Bar-Yossef, Jayram, Krauthgamer,
Kumar 04]
![Page 23: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/23.jpg)
23
The General Sketching ModelAlice Bob
Refereed(x,y) ≤ kd(x,y) ≤ k
x y
x)
y)
d(x,y) ≥ rd(x,y) ≥ r
Shared Randomness
Shared Randomnessk vs. r Gap
Problem
d(x,y) ≤ k or d(x,y) ≥ r
Decide which of the two holds.
ApproximationApproximation
Promise:
Goal:
![Page 24: 1 Lecture 18 Syntactic Web Clustering CS 728 - 2007](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d4c5503460f94a2acbd/html5/thumbnails/24.jpg)
24
Applications
Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files
over the Network Differential backup Synchronization
Theory Low distortion embeddings Simultaneous messages
communication complexity