1 edit distance and large data sets ziv bar-yossef robert krauthgamer ravi kumar t.s. jayram ibm...

Edit Distance and

Large Data Sets

Ziv Bar-Yossef

Robert Krauthgamer

Ravi Kumar

T.S. Jayram

IBM AlmadenTechnion

Motivating Example:Near-Duplicate Elimination

Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97]

• Group pages into clusters of “similar” pages

• Keep one “representative” from each cluster

Crawler

Duplicate elimination

Page Repository

Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97]

• Corpus is huge (billions of pages, 10K/page)

• Streaming access

• Limited main memory

• Linear running time

Challenges

p h(p)

Locality Sensitive Hashes [Indyk, Motwani 98]

Prh[h(p) = h(q)] = sim(p,q)

Cluster:

Collection of pages that have a common sketch

• Can compute sketches in one pass

• Sketches can be stored and processed on a single machine

Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98]

|(q)S(p)S|

Sw(p) Sw(q)

w-shingling:

Sw(p) = all substrings of p of length w

resemblancew(p,q) =

Pr[min((Sw(p)) = min((Sw(q))] =|(q)S(p)S|

|(q)S(p)S|

The Sketching Model

Alice Bob

Refereed(x,y) · kd(x,y) · k

d(x,y) ¸ rd(x,y) ¸ r

Shared Randomness

Shared Randomnessk vs. r Gap

Problem

d(x,y) · k or d(x,y) ¸ r

Decide which of the two holds.

ApproximationApproximation

Promise:

Applications of Sketching

Large data sets

• Clustering• Nearest Neighbor schemes• Data streams Management of Files

over the Network• Differential backup• Synchronization

Theory

• Low distortion embeddings• Simultaneous messages

communication complexity

Known Sketching Schemes

• Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]

• Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]

• Cosine similarity [Charikar 02]

• Earth mover distance [Charikar 02]

In this talk: Edit Distance

Edit Distance

x 2 n, y 2 m

Minimum number of character insertions, deletions and substitutions that transform x to y.

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications

• Genomics

• Text processing

• Web searchFor simplicity: m = n, = {0,1}.

ED(x,y):

Computing Edit Distance

• Dynamic programming (1970) O(n2)• Masek and Paterson (1980) O(n2/log n)

Exact Computation

• Impractical for comparing two very long strings.

• Natural question 1: can we do it in linear time?

• Impractical for handling massive document repositories.

• Natural question 2: are there constant size sketches of edit distance?

Can we solve the above problems if we settle for approximation?

Focus of this

Sketching Schemes for Edit Distance

Algorithm Gap Sketch size

Batu et al O(n) vs. (n) O(nmax(/2, 2 – 1))

This paper k vs. O((kn)2/3) O(1)

This paper

(non-repetitive strings)

k vs. O(k2) O(1)

• No known embeddings of Edit distance into a normed space.

• Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03]

• Weak nearest neighbor schemes [Indyk 04]

Negative Indications

Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98]

Ham(x,y) = # of positions in which x,y differ

Gap: k vs. 2k Sketch size: O(1)

Shared randomness:

r1,…,rn 2 {0,1} are independent and

Sketch: h(x) = (i xi ri ) mod 2

h(y) = (i yi ri ) mod 2

Analysis:

Pr[h(x) h(y)] =

Pr[h(x) + h(y) = 1] =

Pr[i: xi yi ri = 1] =

½(1- (1 – 1/k)Ham(x,y))

x) = (h1(x),…,ht(x)), y) = (h1(y),…,ht(y)), t = O(1)

Edit Distance Sketches: Basic Framework

Underlying Principle

ED(x,y) is small iff x and y share many common substrings at nearby positions.

Sx = set of pairs of the form (,h(i))

a substring of x

h(i): a “locality sensitive” encoding of the substring’s position

ED(x,y) small iff intersection Sx Å Sy

common substrings at nearby positions

Basic Framework (cont.)

•Need to estimate size of symmetric difference

•Hamming distance computation of characteristic vectors

•Use constant size sketches [KOR]

ED(x,y) small iff symmetric difference

Sx Sy small

Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance

General Case: Encoding Scheme

Gap: k vs. O((kn)2/3)

B = n2/3/k1/3, W = n/B

Sx = {

Sy = {

2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

(1,1),

(2,1),

(3,2),

B windows of size W each.

,(i, win(i)),…

Analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 1: ED(x,y) · k

• If i is “unmarked”, it has a matching “companion” j

• (i,win(i)) 2 Sx n Sy, only if:

• either i is “marked”

• or i is unmarked, but win(i) win(j)

• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows

• Therefore, Ham(Sx,Sy) · 4kB

Analysis (cont.)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 2: Ham(Sx,Sy) · 8kB

• If i has a “companion” j and win(i) = win(j), can align i with

j using at most W operations

• Otherwise, substitute first character of i

• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

B+1 2B+1

Non-repetitive Case: Encoding Scheme

1 2 3 4 5 6 7

1 2 3 4 5 67

t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W

Alice and Bob choose a sequence of “anchors” in a coordinated way

1: a random permutation on {0,1}t

1: minimal length-t substring of x1 (under 1)

1: minimal length-t substring of y1 (under 1)

Gap: k vs. O(k W)

Encoding scheme (cont.)

2 3 4 5 6 7

1 2 3 4 5 6 7

2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Sx = { (1,1),…,(8,8) }

Sy = { (1,1),…,(8,8) }

1 2 3 4 5 67

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis

Case 1: ED(x,y) · k.

•All anchors are “unmarked” with probability 1 - kt/W = (1)

•If i,i are unmarked, they are aligned

•# of mismatching substrings · 2k

•Ham(Sx,Sy) · 2k

y 1 2 3 4 5 6 7 8

1 2 3 4 5 671 2 3 4 5 6 7 8

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis (cont.)

Case 2: Ham(Sx,Sy) · 4k

•# of mismatching substrings · 4k

•ED(x,y) · 2 ¢ W ¢ 4k = O(k W).

Approximation in Linear Time

Algorithm Gap Time Approx. factor in O(n) time

Dynamic Programming

k vs. k+1 O(kn) None

Batu et al O(n) vs. (n) O(nmax(/2, 2-1)) None

Cole, Hariharan k vs. 2k O(n + k4) O(n3/4)

This paper k vs. k7/4 O(n) O(n3/7)

Algorithm Gap Time Approx. factor in O(n) time

Cole, Hariharan k vs. 2k O(n + k3) O(n2/3)

This paper k vs. k3/2 O(n) O(n1/3)

Non-repetitive Strings

Arbitrary Strings

Summary and Open Problems• Designed efficient approximation schemes for edit

distance.– Best sketching and linear-time approximations to date

• Subsequent work:– O(n2/3) distortion embedding of edit distance into L1 [Indyk 04]

[Rabani 04]

– Better embeddings of edit distance into L1 [Ostrovsky, Rabani, 05]

– Embeddings of the Ulam metric into L1 [Charikar, Krauthgamer, 05]

• Open Problems– Sketch size lower bounds– Constant factor approximations in linear time– Better embeddings of edit distance– Sketching schemes for other distance measures

Thank You

1 edit distance and large data sets ziv bar-yossef robert krauthgamer ravi kumar t.s. jayram ibm...

distance slide

distance x

x y x y dx

talk slide

y i r i mod

s w p s w q wshingling

y r shared randomness

sketching broder

Documents

1 algorithms for large data sets ziv bar-yossef lecture 13...

1 algorithms for large data sets ziv bar-yossef lecture 5...

1 algorithms for large data sets ziv bar-yossef lecture 7...

1 algorithms for large data sets ziv bar-yossef lecture 10...

andrew block, phd, & yossef s. ben-porath, phd

yossef a. elabd, ph.d. - engineering.tamu.edu · y. a....

1 algorithms for large data sets ziv bar-yossef lecture 13...

sketching and embedding are equivalent for norms alexandr...

overcoming the l 1 non- embeddability barrier robert...

the mmpi-2 and mmpi-2-rf by yossef s. ben-porath, ph.d. ·...

yossef a. elabd, ph.d. - texas a&m university · y. a....

algorithmic game theory uri feige robi krauthgamer ...

massive data sets and information theory ziv bar-yossef...

1 algorithms for large data sets ziv bar-yossef lecture 7...

200701 marxist scientific realism jayram

olap over uncertain and imprecise data doug burdick, prasad...

1 algorithms for large data sets ziv bar-yossef lecture 11...

1 massive data sets: theory & practice ziv bar-yossef ibm...

vertex sparsifiers: new results from old techniques (and...

lecturer: moni naor algorithmic game theory uri feige robi...