on embedding edit distance into l_11 on embedding edit distance into l 1 robert krauthgamer...
TRANSCRIPT
On Embedding Edit Distance into L_1
1
On Embedding Edit Distance into L1
Robert Krauthgamer (Weizmann Institute and IBM Almaden)
Based on joint work
(i) with Moses Charikar,
(ii) with Yuval Rabani,
(iii) with Parikshit Gopalan and T.S. Jayram.
(iv) with Alex Andoni
On Embedding Edit Distance into L_1
2
x 2 n, y 2 m
ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance]
Examples:
ED(00000, 1111) = 5
ED(01010, 10101) = 2
Applications:
• Genomics
• Text processing
• Web searchFor simplicity: m = n.
Edit Distance
X
On Embedding Edit Distance into L_1
3
Embedding into L1
An embedding of (X,d) into l1 is a map f : X! l1. It has distortion K¸1 if
d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y) 8x,y2X
Very powerful concept (when distortion is small)
Goal: Embed edit distance into l1 with small distortion Motivation:
Reduce algorithmic problems to l1 E.g. Nearest-Neighbor Search
Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.
On Embedding Edit Distance into L_1
4
Large Gap … Despite signficant effort!!!
Known Results for Edit Distance
O(n2/3) [Bar Yossef-Jayram-K.-Kumar’04]
2O(√log n) [Ostrovsky-Rabani’05]
Upper
bound:
Lower
bound:
(log n)1/2-o(1) [Khot-Naor’05] and
3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03]
(log n)
[K.-Rabani’06]
Previous boundsEmbed ({0,1}n, ED) into L1
On Embedding Edit Distance into L_1
5
Submetrics (Restricted Strings) Why focus on submetrics of edit distance?
May admit smaller distortion Partial progress towards general case A framework to analyzing non worst-case instances
Example (a la computational biology): Handle only “typical” strings
Class 1: A string is k-non-repetitive if all its k-substrings are distinct
A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings
Class 2: Ulam metric = edit distance on all permutations (here ={1,…,n}) Every permutation is 1-non-repetitive Note: k-non-repetitive strings embed into Ulam with distortion k.
Theory of Computation Seminar, Computer Science Department
k=7
On Embedding Edit Distance into L_1
6
Large Gap … Near-tight!
Known Results for Ulam Metric
O(log n) [Charikar-K.’06]
(New proof by [Gopalan-Jayram-K.]) 2O(√log n) [Ostrovsky-Rabani’05]
Upper
bound:
Lower
bound:
log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger)
(log n)
[K.-Rabani’06]
Embed Ulam metric into L1 Embed ({0,1}n, ED) into L1
On Embedding Edit Distance into L_1
7
Embedding of permutationsTheorem [Charikar-K.’06]: The Ulam metric of dimension n embeds
into l1 with distortion O(log n).
Proof. Define where
Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q) Suppose Q is obtained from P by moving one symbol, say ‘s’
General case then follows by applying triangle inequality on P,P’,P’’,…,Q Total contribution of
coordinates s2{a,b} is 2k (1/k) ≤ O(log n) other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)
Intuition: sign(fa,b(P)) is indicator for “a appears before b” in P Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q
On Embedding Edit Distance into L_1
8
Embedding of permutationsTheorem [Charikar-K.’06]: The Ulam metric of dimension n embeds
into l1 with distortion O(log n).
Proof. Define where
Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)
Claim 2: ||f(P)-f(Q)||1 ¸ ½ ED(P,Q) Assume wlog that P=identity Edit Q into an increasing sequence (thus into P) using quicksort:
Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions
Now argue ||f(P)-f(Q)||1 ¸ E[ #quicksort deletions ] ¸ ½ ED(P,Q)
Surviving subsequence is increasing
ED(P,Q) ≤ 2 #deletions
For every inversion (a,b) in Q:Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|
On Embedding Edit Distance into L_1
9
Lower bound for 0-1 stringsTheorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires
distortion (log n) Proof sketch: Suppose embeds with distortion D¸1, and let V={0,1}n. By the cut-cone characterization of L1:
For every symmetric probability distributions and over V£V,
The embedding f into L1 can be written as
Hence,
On Embedding Edit Distance into L_1
10
Lower bound for 0-1 stringsTheorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires
distortion (log n) Proof sketch: Suppose embeds with distortion D¸1, and let V={0,1}n. By the cut-cone characterization of L1:
For every symmetric probability distributions and over V£V,
We choose: =uniform over V£V =½(H+S) where
H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1}) S=random point+a cyclic shift (uniform over ES={(x,S(x)} )
The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all AµV, the LHS of (*) is (log n) / n.
Analysis of Boolean functions on the hypercube
On Embedding Edit Distance into L_1
11
Lower bound for 0-1 strings – cont. Recall =½(H+S) where
H=random point+random bit flip
S=random point+a cyclic shift
Lemma: For all AµV, the LHS of (*) is
Proof sketch: Assume to contrary, and define f = 1A.
On Embedding Edit Distance into L_1
12
Lower bound for 0-1 strings – cont. Claim: Ij ¸ 1/n1/8 ) Ij+1 ¸ 1/2n1/8 Proof:
x
x+ej S(x+ej)
flip bit j
cyclic shift
S(x)
flip bit j+1
cyclic shift
= S(x )+ej+1
On Embedding Edit Distance into L_1
13
Communication Complexity Approach
Alice
x2n y2n
randomness
Distance Estimation Problem: decide whether d(x,y)¸R or d(x,y)·R/A
Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CCA = min. # bits to decide whp…
CCA bitsBob
Previous communication lower bounds: l1 [Saks-Sun’02, BarYossef-Jayram-
Kumar-Shivakumar’04] l1 [Woodruff’04]
Earthmover [Andoni-Indyk-K.’07]
On Embedding Edit Distance into L_1
14
Communication Bounds for Edit DistanceA tradeoff between approximation and communication Theorem [Andoni-K.’07]:
For Hamming distance: CC1+ = (1/2)[Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04]
First computational model where edit is provably harder than Hamming!
Corollary 1: Approximation A=O(1) requires CCA ¸ (loglog n)
Corollary 2: Communication CCA=O(1) requires A ¸ *(log n)
Implications to embeddings: Embedding ED into L1 (or squared-L2) requires distortion *(log n)
Furthermore, holds for both 0-1 strings and permutations (Ulam)
¢A A ¸ n
On Embedding Edit Distance into L_1
15
Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity
If CCA≤k then for every two distributions far,close there is a k-bit deterministic protocol with success probability ¸ 2/3
Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions sA,sB :n{0,1} with advantage
Pr(x,y)2 far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸ (2-k)
Step 3 [Fourier expansion]: Reduce to one Fourier level Furthermore, sA,sB depend only on fixed positions j1,…,j
Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions Let close,far include -noise handle a high level
Let close,far include (few/more) block rotations handle a low level
Step 5: Reduce Ulam to {0,1}n A random mapping {0,1} works
Key property: distribution of (xj1,…,xj, yj1,…,yj) is “statistically close” under far vs. under close
Compare this additive analysis to our previous analysis:
On Embedding Edit Distance into L_1
16
Summary of Known Results
O(log n) [Charikar-K.’06]
(New proof by [Gopalan-Jayram-K.]) 2O(√log n) [Ostrovsky-Rabani’05]
Upper
bound:
Lower
bound:
log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger)
(log n)
[K.-Rabani’06]
Embed Ulam metric into L1 Embed ({0,1}n, ED) into L1
On Embedding Edit Distance into L_1
17
Concluding Remarks The computational lens
Study Distance Estimation problems rather than embeddings
Open problems: Still large gap for 0-1 strings Variants of edit distance (e.g. edit distance with block-moves) Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)
Recent progress: Bypass L1-embedding by devising new techniques
E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.]
Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]