![Page 1: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/1.jpg)
1
Edit Distance and
Large Data Sets
Ziv Bar-Yossef
Robert Krauthgamer
Ravi Kumar
T.S. Jayram
IBM AlmadenTechnion
![Page 2: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/2.jpg)
2
Motivating Example:Near-Duplicate Elimination
Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97]
• Group pages into clusters of “similar” pages
• Keep one “representative” from each cluster
Crawler
Duplicate elimination
Page Repository
Page Repository
![Page 3: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/3.jpg)
3
Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97]
• Corpus is huge (billions of pages, 10K/page)
• Streaming access
• Limited main memory
• Linear running time
Challenges
p h(p)
Locality Sensitive Hashes [Indyk, Motwani 98]
Prh[h(p) = h(q)] = sim(p,q)
Cluster:
Collection of pages that have a common sketch
• Can compute sketches in one pass
• Sketches can be stored and processed on a single machine
![Page 4: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/4.jpg)
4
Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98]
|(q)S(p)S|
|(q)S(p)S|
ww
ww
Sw(p) Sw(q)
w-shingling:
Sw(p) = all substrings of p of length w
resemblancew(p,q) =
Pr[min((Sw(p)) = min((Sw(q))] =|(q)S(p)S|
|(q)S(p)S|
ww
ww
![Page 5: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/5.jpg)
5
The Sketching Model
Alice Bob
Refereed(x,y) · kd(x,y) · k
x y
x)
y)
d(x,y) ¸ rd(x,y) ¸ r
Shared Randomness
Shared Randomnessk vs. r Gap
Problem
d(x,y) · k or d(x,y) ¸ r
Decide which of the two holds.
ApproximationApproximation
Promise:
Goal:
![Page 6: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/6.jpg)
6
Applications of Sketching
Large data sets
• Clustering• Nearest Neighbor schemes• Data streams Management of Files
over the Network• Differential backup• Synchronization
Theory
• Low distortion embeddings• Simultaneous messages
communication complexity
![Page 7: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/7.jpg)
7
Known Sketching Schemes
• Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]
• Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
• Cosine similarity [Charikar 02]
• Earth mover distance [Charikar 02]
In this talk: Edit Distance
![Page 8: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/8.jpg)
8
Edit Distance
x 2 n, y 2 m
Minimum number of character insertions, deletions and substitutions that transform x to y.
Examples:
ED(00000, 1111) = 5
ED(01010, 10101) = 2
Applications
• Genomics
• Text processing
• Web searchFor simplicity: m = n, = {0,1}.
ED(x,y):
![Page 9: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/9.jpg)
9
Computing Edit Distance
• Dynamic programming (1970) O(n2)• Masek and Paterson (1980) O(n2/log n)
Exact Computation
• Impractical for comparing two very long strings.
• Natural question 1: can we do it in linear time?
• Impractical for handling massive document repositories.
• Natural question 2: are there constant size sketches of edit distance?
Can we solve the above problems if we settle for approximation?
Can we solve the above problems if we settle for approximation?
Focus of this
talk
![Page 10: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/10.jpg)
10
Sketching Schemes for Edit Distance
Algorithm Gap Sketch size
Batu et al O(n) vs. (n) O(nmax(/2, 2 – 1))
This paper k vs. O((kn)2/3) O(1)
This paper
(non-repetitive strings)
k vs. O(k2) O(1)
• No known embeddings of Edit distance into a normed space.
• Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03]
• Weak nearest neighbor schemes [Indyk 04]
Negative Indications
![Page 11: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/11.jpg)
11
Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98]
Ham(x,y) = # of positions in which x,y differ
Gap: k vs. 2k Sketch size: O(1)
Shared randomness:
r1,…,rn 2 {0,1} are independent and
Sketch: h(x) = (i xi ri ) mod 2
h(y) = (i yi ri ) mod 2
Analysis:
Pr[h(x) h(y)] =
Pr[h(x) + h(y) = 1] =
Pr[i: xi yi ri = 1] =
½(1- (1 – 1/k)Ham(x,y))
x) = (h1(x),…,ht(x)), y) = (h1(y),…,ht(y)), t = O(1)
![Page 12: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/12.jpg)
12
Edit Distance Sketches: Basic Framework
Underlying Principle
ED(x,y) is small iff x and y share many common substrings at nearby positions.
Sx = set of pairs of the form (,h(i))
a substring of x
h(i): a “locality sensitive” encoding of the substring’s position
x
Sx
y
Sy
ED(x,y) small iff intersection Sx Å Sy
large
common substrings at nearby positions
![Page 13: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/13.jpg)
13
Basic Framework (cont.)
•Need to estimate size of symmetric difference
•Hamming distance computation of characteristic vectors
•Use constant size sketches [KOR]
x
Sx
y
Sy
ED(x,y) small iff symmetric difference
Sx Sy small
Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance
![Page 14: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/14.jpg)
14
1 2 3
12 3
General Case: Encoding Scheme
Gap: k vs. O((kn)2/3)
x
y
B = n2/3/k1/3, W = n/B
1
Sx = {
Sy = {
2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
(1,1),
(1,1),
(2,1),
(2,1),
(3,2),
(3,2),
…
…
B windows of size W each.
,(i, win(i)),…
,(i, win(i)),…
![Page 15: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/15.jpg)
15
Analysis
j
ix
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 1: ED(x,y) · k
• If i is “unmarked”, it has a matching “companion” j
• (i,win(i)) 2 Sx n Sy, only if:
• either i is “marked”
• or i is unmarked, but win(i) win(j)
• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows
• Therefore, Ham(Sx,Sy) · 4kB
![Page 16: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/16.jpg)
16
Analysis (cont.)
2
1x
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 2: Ham(Sx,Sy) · 8kB
• If i has a “companion” j and win(i) = win(j), can align i with
j using at most W operations
• Otherwise, substitute first character of i
• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)
B+1 2B+1
B-1
![Page 17: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/17.jpg)
17
y2
x2
y1
x1
Non-repetitive Case: Encoding Scheme
1 2 3 4 5 6 7
1 2 3 4 5 67
t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W
x
y
W
W
Alice and Bob choose a sequence of “anchors” in a coordinated way
1: a random permutation on {0,1}t
1: minimal length-t substring of x1 (under 1)
1: minimal length-t substring of y1 (under 1)
Gap: k vs. O(k W)
![Page 18: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/18.jpg)
18
11
Encoding scheme (cont.)
2 3 4 5 6 7
1 2 3 4 5 6 7
2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
x
y
Sx = { (1,1),…,(8,8) }
Sy = { (1,1),…,(8,8) }
![Page 19: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/19.jpg)
19
1 2 3 4 5 67
1 2 3 4 5 6 71 2 3 4 5 6 7 8
Analysis
Case 1: ED(x,y) · k.
•All anchors are “unmarked” with probability 1 - kt/W = (1)
•If i,i are unmarked, they are aligned
•# of mismatching substrings · 2k
•Ham(Sx,Sy) · 2k
x
y 1 2 3 4 5 6 7 8
![Page 20: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/20.jpg)
20
1 2 3 4 5 671 2 3 4 5 6 7 8
1 2 3 4 5 6 71 2 3 4 5 6 7 8
Analysis (cont.)
Case 2: Ham(Sx,Sy) · 4k
•# of mismatching substrings · 4k
•ED(x,y) · 2 ¢ W ¢ 4k = O(k W).
x
y
![Page 21: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/21.jpg)
21
Approximation in Linear Time
Algorithm Gap Time Approx. factor in O(n) time
Dynamic Programming
k vs. k+1 O(kn) None
Batu et al O(n) vs. (n) O(nmax(/2, 2-1)) None
Cole, Hariharan k vs. 2k O(n + k4) O(n3/4)
This paper k vs. k7/4 O(n) O(n3/7)
Algorithm Gap Time Approx. factor in O(n) time
Cole, Hariharan k vs. 2k O(n + k3) O(n2/3)
This paper k vs. k3/2 O(n) O(n1/3)
Non-repetitive Strings
Arbitrary Strings
![Page 22: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/22.jpg)
22
Summary and Open Problems• Designed efficient approximation schemes for edit
distance.– Best sketching and linear-time approximations to date
• Subsequent work:– O(n2/3) distortion embedding of edit distance into L1 [Indyk 04]
[Rabani 04]
– Better embeddings of edit distance into L1 [Ostrovsky, Rabani, 05]
– Embeddings of the Ulam metric into L1 [Charikar, Krauthgamer, 05]
• Open Problems– Sketch size lower bounds– Constant factor approximations in linear time– Better embeddings of edit distance– Sketching schemes for other distance measures
![Page 23: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d625503460f94a44e1b/html5/thumbnails/23.jpg)
23
Thank You