faster algorithm for string matching with k mismatches (ii) amihood amir, moshe lewenstin, ely porat...

20
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275 Date : Dec. 24, 2004 Created by : Hsing-Yen Ann

Upload: allan-lamb

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

Faster Algorithm for String Matching with k Mismatches (II)

Amihood Amir, Moshe Lewenstin, Ely PoratJournal of Algorithms, Vol. 50, 2004, pp. 257-

275

Date : Dec. 24, 2004Created by : Hsing-Yen Ann

Page 2: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Problem Definition String matching with k mismatches:

Input:

Text T = t1t2...tn

Pattern P = p1p2...pm

A natural number k

Output:

All pairs <i, ham(P, T[i,i+m-1])>,

where 1≦i ≦n and ham(P, T[i,i+m-1])≦k

ham(): hamming distance (# of errors)

Page 3: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for Solving this Problem Two-stage algorithm Marking stage

Identifying the potential starts of the pattern. Reducing the # to be verified. Focused in this paper.

Verification stage Verifying which of the potential candidates is

indeed a pattern occurrence. Using the Kangaroo method for speed-up.

O(1) for jumping to next mismatch.

Page 4: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Previous Conclusion This problem can be solved by previous

presented algorithms in .

When :

When : use another algorithm.

Finally, this problem can be solved in .

3/1mk

3/1mk

mknO log

kOm loglog

kknOmknO loglog

kknO log

)log)/(( 3 kmnknO

1/3 mk

)log()log)/(( 3 knOkmnknO

Page 5: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Periodicity

periodic:S is periodic if S=ujw, where j 2≧ and w is a prefix of u.

aperiodic: a string is not periodic

Periodic Aperiodic

A A A A A A AAB AB AB AABCD ABCD ABCA A

A B C D EAB AABCD ABCA

Page 6: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Breaks

break: an aperiodic substring of a string S. l-break: a break of length l.

Cole and Hariharan[9] give a linear time algorithm to find out all l-breaks with given l.

S

periodic 1 periodic 2 periodic 3 periodic 4

breaks —aperiodic substring of S

Page 7: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Breaks (cont’d)

The goodness of break:A l-break in P exactly match to T at position i implies that the next position in T to match this l-break will be at least i + (l/2).

T

l-break

at least l/2

l-break

l-break

i i+(l/2)

Page 8: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Some Lemmas Lemma 3:

Let P be a pattern with 2k disjoint l-breaks and let T be a text. In each match (with k mismatches) of P in T at least k of the l-breaks match exactly.

Lemma 4:Let P be an m length pattern with less than 2k l-breaks. Let T be of length 2m. Then all matches of P in T are in a substring of T which has at most O(k) l-breaks.

Page 9: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Time Complexity on Different Cases

Case 1:There are at least 2k disjoint k-breaks in P.Time: O(n+m) = O(n)

Case 2:There are at least 2k disjoint l-breaks in P, where 2 ≦ l ≦ k-1.Time: O(k log k) for each local match

Case 3:There are not even 2k disjoint 2-breaks. Dominated pattern: O(n + m log k + (nk3 log k)/m) Non-dominated pattern: O(n + m log k + (nk4 log k)/m)

Page 10: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k k-breaks in P Algorithm:

1. Find all exact matches of all breaks in the text.2. For every such match, mark all text locations

for pattern occurrences appropriate for this break.

3. Discard every text location that is marked less than k marks.

Result:1. There are at most (4n)/k candidates left.2. The candidates can be marked in O(n+m) time.3. The verification stage needs O(n) time.

Page 11: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k k-breaks in P (cont’d)

T

mark[i]=3 i

b1 b2 b4

b3b1 b2 b4 b5

Tmatches by b1b1b1 b1

overlap range - at most l/2

b1

T

T

matches by b2

matches by b3b3 b3 b3

b2 b2 b2

Page 12: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k l-breaks in P Algorithm:

1. Let S={b1, …, b2k} be a set of 2k disjoint l-breaks of P.

2. Let S’={b1’, …, bf’} be the distinct subset of S. S’ can be found in O(m) time.

b3b1 b2 b4 b5P => S={b1, b2, b3, b4, b5}

b1'P => S’={b1’, b2’, b3’}b2' b3'

Page 13: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k l-breaks in P (cont’d)

3. Partition the text T to the local matching form T’={T1’, T2’, …, T2n/k -1’}.

Local match:Split the text T into 2n/k -1 overlap substrings, for which the length is k, T’={T1’, T2’, …, T2n/k -1’}. Then solves the problem by doing the local match separately.

T

T2'

T3'

T4'

T5' T7'

T8'

T9''

12

k

nTT1'

T6''

22

k

nT

'3

2

k

nT

'4

2

k

nT

Page 14: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k l-breaks in P (cont’d)

4. For each piece Ti' and each break bj' in S' create a balanced binary tree Tree(i,j).

The height of each tree is O(log k). The number of trees is at most

|T'| × |S'| = (2n)/k × 2k = O(n).

T2' b3'b3' b3' b3'

3 14 27 34

3

14

27

34=> Tree(2,3)

Page 15: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for 2k l-breaks in P (cont’d) There are at most n leave nodes in all trees.

=> The trees can be constructed in O(n) time.

Given l contiguous text locations, the (at most 4) candidates can be identified in time |S'| × O(log k) = O(k log k).=> All the candidates can be marked in time |T'| × O(k log k) = O(n log k).

There are at most 4n / l candidates. The verification stage needs O(n) time.

Page 16: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for no 2k 2-breaks in P

Definition: l-segment:

Partition the P to equal segment of size l. Dominated patterns:

At most 4k segments do not have general period w.

bad l-segment:A l-segment that is not fully within a periodic stretch of S.

good l-segment

w

bad l-segment

w w ww

Page 17: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for no 2k 2-breaks in P (cont’d)

Lemma 6.Let P be a pattern with a dominating period w. In the partition of P into l-segments there are at most 8k bad l-segments.

The algorithm for dominated patterns can be done in O(n + m log k + (nk3 log k)/m) time.

For a non-dominated pattern P, there exists a sparsifying substring P' of length Ω(m/k). Then P' is a dominated pattern. The algorithm can be done in O(n + (nk4 log k)/m) time.

Page 18: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for no 2k 2-breaks in P (cont’d)

1. Find all matches of P in T at overlapping (bad l-segment) locations.2. For each bad l-segment B do pattern matching, with pattern B and w2l*.

3. Do pattern matching with mismatches, with pattern w and text w2l*.

4. Compute the # of mismatches of P at the first |w| locations of T using steps 2 and 3.

5. i <= |w| + 1.6. While end of text not reached 6a. if i is not an overlapping location 6aa. # of mismatches at location i <= # of mismatches at location i-|w|, 6ab. i <= i + 1; 6b. else, if j is the next non-overlapping location 6ba. for each of the bad l-segment that participate in an overlap in the

overlapping locations (bad segment vs. bad segment) from i to j, update the # of mismatches it accrues in the next |w| locations,

6bb. i <= j .

Page 19: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for no 2k 2-breaks in P (cont’d)

T

P

all bad l-segment overlaps from T to P,at most O(k2) overlaps

T bad

w2l*

T w

w2l*

w w w w

bad bad

Page 20: Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275

2004/11/22 Hsing-Yen Ann

Algorithm for no 2k 2-breaks in P (cont’d)

ii-|w|add the # of mismatched at i-|w|

overlap

compute the # of mismatchedin this region

|w|

ii-|w|

not overlap

copy the # of mismatched at i-|w|