1 fast parallel and serial approximate string matching journal of algorithms, vol.10 (1989),...
Post on 27-Mar-2015
214 Views
Preview:
TRANSCRIPT
1
Fast Parallel and Serial Approximate String Matching
Journal of Algorithms, Vol.10 (1989), pp.157-169.G. Landau and U. Vishkin
Advisor: Prof. R. C. T. Lee
Speaker: L. Y. Huang
2
Problem
• Give two arrays: P = p1p2…pm – the pattern, and T = t1t2…tn – the text, and an integer k (k 1),≧ find all occurrences of the pattern in the text with edit distances at most equal to k.
3
• This algorithm improves the Alternative Dynamic Programming Computation.
• First, we introduce the Dynamic Programming Computation.
4
The Dynamic Programming Algorithm[S80]
• In the dynamic programming approach, we construct a matrix Dn+1,m+1 when Di,j is the minimum edit distance between P(1, j) and any substring in T which ends at Ti.
• Example:
T = gggtcta
P = gttc
k = 221101112t
2
1
1
0
t
1
1
1
0
c atgg
223334c
212223t
110001g
000000
i 1 2 3 4 5 6 7
j
1
2
3
4
g
5
• We found:– gt gt gt – gttc g t t gt
– g t c gtc– g t t c gtc
Distance =2(1)
Distance =1(2)
21101112t
2
1
1
0
t
1
1
1
0
c atgg
223334c
212223t
110001g
000000
i 1 2 3 4 5 6 7
j
1
2
3
4
g
6
– g t c t g t c t gtct– g t t c g t t t gtct–
– g t c t g t c t gtct– g t t c g t t gtct
– g t c t a g t c t a gtcta– g t t c g t t a gtcta
Distance =2
Distance =2
Distance =2
(3)
(4)
(5)
21101112t
2
1
1
0
t
1
1
1
0
c atgg
223334c
212223t
110001g
000000
i 1 2 3 4 5 6 7
j
1
2
3
4
g
7
An alternative Dynamic Programming Computation
• We should heavily use the concept of diagonal.
• Diagonal d is defined as all of the Di,j’s where d = i – j.
Diagonal 2
Diagonal 0
1
0122c
101b
0000
cba
i 1 2 3
j
1
2
8
• We first have the following:– (a) If Ti= Pj, Di,j = Di-1,j-1;
– (b) otherwise, Di,j = Di-1,j-1+1 (subsitutaion) or Di,j = Di, j-1+1 (deletion) or Di,j = Di-1,j (insertion)
9
• Consider any diagonal d. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and Di,j = 0.
• Let us now label all of these locations.
c
t
0t
000 g
00000000
atctgggi 1 2 3 4 5 6 7
j
1
2
3
4
Diagonal 0Diagonal 1
Diagonal 2
10
• Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and Di,j = 1.
• To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.
11
• Let us consider any (i, j) location on Diagonal d.
• Why can Di,j suddenly become 1?– It can only be influenced as shown below:
• Thus, we conclude that we only need to consider Diagonals d-1, d and d+1.
Di-1, j-1Di, j-1
Di-1, jDi, j
d
d+1
d-1
delete
insert
substitution
12
• Let us consider the following table.
• Question: what is the value of D4,3?– It can not be 0 because we have already decided that on
Diagonal 1, the largest j on Diagonal 1 is 1. Thus D4,3=1.
j
1
2
3
4
d =1
i 1 2 3 4 5 6 7
0c
?0t
00t
0000 g
00000000
atctggg
13
• Question: What is the value of D5,4?
– Since T5 =P4, D5,4 =D4,3 =1.
j
1
2
3
4
d =1
i 1 2 3 4 5 6 7
?0c
10t
00t
0000 g
00000000
atctggg
14
• Based upon the above discussion, we can find all (i,j)s where Di,j =1 after finding all (i’, j’)s when Di’,j’ =0.
• In fact, after finding all Di,js where Di,j = e, we can find all (i’, j’)s where Di’,j’ = e+1. Thus the dynamic programming table does not have to computed.
• In the following, we shall give the Alternative Dynamic Programming Computations Method formally.
15
• Let Ld,e denote the largest row j such that Di,j is on the Diagonal d (i- j = d) and Di,j =e.
• Based upon this definition, e is the minimum edit distance between any substring of T ending at TLd,e+d and PLd,e+1 ≠TLd,e+d+1
• Let d =3. L3,0 = 0, L3,1=3, L3,2 =4
i 1 2 3 4 5 6 7
21 223334c
21101112t
1
1
0
t
1
1
0
c atggg
212223t
110001g
000000j
1
2
3
4
16
• Example:
– T = gggtcta
– P = gttc
– k = 2
• Now, L3,1 = 3. It means that we have found a substring A, which is T(3,6)=gtct, ending at TLd,e+d = T3+3 =T6, such that the edit dista
nce between A and P(1,3) = gtt is 1.
• PLd,e+1 ≠TLd,e+d+1 P3+1 ≠T3+3+1
g g g t c t a
0 0 0 0 0 0 0 0
g 1 0 0 0 1 1 1 1
t 2 1 1 1 0 1 1 2
t 3 2 2 2 1 1 1 2
c 4 3 3 3 2 1 2 2
i 1 2 3 4 5 6 7
j
1
2
3
4
17
• Example:– T = gggtcta – P = gttc – k = 2
• Now, L1,1 = 4 = m. It means that we have found substring A, which is T(2,5)=ggtc, ending at TLd,e+d = T3+3 =T6, such that the edit distance between A and P(1,3) = gtt is 1.
• They are T(2,5) = ggtc and P = gttc.
22123334c
21112223t
21101112t
11110001 g
00000000
atctggg
j
1
2
3
4
i 1 2 3 4 5 6 7
18
• The alternative dynamic algorithm computation is to compute the Ld,e’s value.
19
g g g t c t a
0 0 0 0 0 0 0 0
g 0
t 0
t 0
c 0
An alternative Dynamic Programming Computation
• First, we set the initial value.
• Example:– T = gggtcta– P= gttc
20
g g g t c t a
0 0 0 0 0 0 0 0
g 0 0 0
t 0
t 0
c 0
i 1 2 3 4 5 6 7
j
1
2
3
4
• e =0• From d = 0 to d = n, if P[1…j] is equal T[d+1…i],
then we set the value of Ld,0 = j.• d = 0
• P1 = T1, L0,0 =1
d=0
21
g g g t c t a
0 0 0 0 0 0 0 0
g 0 0 0
t 0
t 0
c 0
i 1 2 3 4 5 6 7
j
1
2
3
4
• e =0• d = 1
• P1 = T2, L1,0 =1
d=1
22
g g g t c t a
0 0 0 0 0 0 0 0
g 0 0 0 0
t 0 0
t 0
c 0
i 1 2 3 4 5 6 7
j
1
2
3
4
• e =0
• d =2
• P1=T3, P2 = T4, L2,0 = 2
d=2
23
• Our approach is based upon Rule 1 proposed by Professor Lee.
• Consider tow substring A1 and A2 as shown below:
A1 P1 S1
A2 P2 S2
If d(A1, A2) ≦k and S1=S2, then d(P1, P2) ≦k.
24
• Observe the following:
• If d(A1,A2) = k, S1 = S2, x ≠ y, then d(A1+S1+x, A2+S2+y) ≦ k+1
A1
A2
S1
S2
x
y
25
• For e≠0, we search through d = -e to d =n.
• Let row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)].
(subsitutaion) (deletion) (insertion)
• Find the largest j, if it exists, such that P(row+1, j) = T(row+1+d, i) =T(row +1+i-j, i), set Ld,e =j. If no such j exists, set Ld,e = row.
26
• Let row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)].
(subsitutaion) (deletion) (insertion)
Ld,e-1
Ld-1,e-1 Ld+1,e-1
Diagonal d
Diagonal d+1
Diagonal d-1
substitution
deletioninsertion
27
• row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[1+1, 2, 1+1] = max[2, 2, 2] = 2• P(row+1, j) ≠ T(row+1+d, i) , P3 ≠ T2
• L-1,1 = 2
d = -1
i 1 2 3 4 5 6 7
j
1
2
3
4 0c
0t
00t
0000 g
00000000
atctggg
28
• row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[1+1, 1, 1+1]
= max[2, 1, 2] = 2• P(row+1, j) ≠ T(row+1+d, i) , P3 ≠ T3
• L0,1 = 2
i 1 2 3 4 5 6 7
d =0
j
1
2
3
4 0c
0t
010t
0000 g
00000000
atctggg
29
• row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]
= max[1+1, 1, 2+1]= max[2, 1, 3] = 3
• P(row+1, j) = T(row+1+d, i) = P4 = T5 = c
• L1,1 = 4 = m
• We find an occurrence of the pattern in the text with edit distance at most 1 that ends at Td+m = T1+4 = T5
j
1
2
3
4
d =1
i 1 2 3 4 5 6 7
0c
0t
0110t
0000 g
00000000
atctggg
30
10c
110t
0110t
0000 g
00000000
atctgggi 1 2 3 4 5 6 7
j
1
2
3
4
d =3
• row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0+1, 2, 0+1]
= max[1, 2, 1] = 2• P(row+1, j) = T(row+1+d, i) , P3 = T6 , P4 ≠T7
• L3,1 = 3
31
• row = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]
= max[3+1, 3, 2+1] = max[4, 3, 3] = 4
• L3,2 = 4 = m • We find an occurrence of the pattern in the text with e
dit distance at most 2 that ends at td+m = t3+4 = t7.
22120c
1112220t
1101110t
1110000 g
00000000
atctgggj 1 2 3 4 5 6 7
i
1
2
3
4
d =3
32
An alternative Dynamic Programming Computation
Initialization for all d, 0≦d ≦n, Ld,-1 = -1 for all d, -(k+1) ≦d -1, ≦ Ld,|d|-1 = |d|, Ld,|d|-2 = |d|-2
for all e, -1 ≦e ≦k, Ln+1,e = -1For e = 0 to k do
For d = -e to n dorow = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]row = min(row,m)while row < m and row +d <n and arow+1 = trow+1+d do
row = row + 1Ld,e = rowif Ld,e = m then
print *there is an occurrence ending at td+m*
33
Different with this algorithm
• In the alternative dynamic algorithm computation, we must search j such that P(row+1,j) = T (row +1+d, i) = T (row +1+i-j, i).
• Essentially, we are looking for S1 and S2 in T and P respectively, as show below:
• This paper will use LCA (lowest common ancestor) to improve this searching part.
A1
A2
S1
S2
x
y
34
• This algorithm has two steps:– Concatenate the text and the pattern to one string t1,
…,tn,p1,…pm. Compute the “suffix tree” of this string.
– Find all occurrence of the pattern in the text with edit distance at most k.
Algorithm
35
T = ABCDEA
P = DDBE
S = ABCDEADDBE
Suffix tree of a string with length n can be constructed in O(n).
Weiner, 1973McCreight, 1976Ukkonen, 1995
3
CDEADDBE$
A
B DE
61
924 7 8
105
BCDEADDBE$DDBE$
CDEADDBE$
E$
EADDBE$DBE$
BE$
ADDBE$ $
36
The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time.
Harel and Tarjan, 1984
3
CDEADDBE$
A
B DE
61
924 7 8
105
BCDEADDBE$DDBE$
CDEADDBE$
E$
EADDBE$DBE$
BE$
ADDBE$ $
37
• To find such S, if it exists, we may concatenate T and P to find a new string.
• Obviously, on the suffix tree, suffixes S1 and S2 have a common ancestor S.
A1
A2
S
S
x
y
T
P
S x yS
S1
S2
38
• If we want to compute L3,1, we will use L2,0, L3,0, L4,0 to decide the row value (row =2).
1
0
a
0a
0a
1110t
101110t
10000 g
00000000
ctctgggi 1 2 3 4 5 6 7 8
j
1
2
3
4
5 d=3
In this paper, we find the length of LCA2,3 is 2.q = 2L3,1 = row +2 =4
tacgggtc g atat
S1
S2
39
a$
taa$
cgttaa$
a gc t
tacgttaa$
gttaa$
g t
gtctacgttaa$
tctacgttaa$
taa$
ctacgttaa$
ctacgttaa$
a
a$
cgttac$
$
S= gggtctacgttac
textpattern
40
Time Complexity
• An alternative Dynamic Programming Computation takes O(mn) time.
• The suffix tree has O(n) nodes.
• LCA query responds in O(1) time.
• For each of the n+k+1 diagonals, we evaluate (k+1)Ld,e’s
• This algorithm takes O(nk) time.
41
• [AHU-74] A. V. AHO, J. W. HOPCROFT, AND J. D. ULLMAN, “The Designand Analysis of Computer Algorithms,” Addison- Wesley, Reading, MA, 1974
• [AILSV-88] A. APOSTOLICO, C. ILIOPOULOS, G.M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree with applications, Algorithmica 3(1988), 347-365.
• [BM-77] R.S. BOYER AND J. S. MOORE, Afast string searching algorithm, Comm. ACM 20(1977), 762-772
• [CS-85] M. T. CHEN AND J. SEIFERAS, Efficient and elegant subword tree construction, in “Combinatiorial Algorithms on Words,” (A. Apostolico and Z. Galil, ED.), NATO ASI Series F: Computer and System Sciences Vol. 12, pp. 97-107, Springer-Verlag, New York/ Berlin, 1985.
• [G-84] Z. GALIL, Optimal parallel algorithms for string matching, in “”Proceedings, 16th ACM Symposium on Theory of Computing, 1984” pp..240-248; Inform. And CONTROL 67(1985), 144-157.
• [GG-86] Z. GALIL AND R. GIANCARLO, Improved string matching with k mismatches, SIGACT News 17, No. 4(1986), 52-54.
• [GG-87] Z. GALIL AND R. GIANCARLO, Parallel string matching with k mismatches, Theoret. Comput. Sci. 51(1987), 341-348.
• [GS-83] Z. GALIL AND J. I. SEFIERAS, Time-space-optimal string matching, J. Comput. System Sci. 26(1983),280-294
• [HT-84] D. HAREL AND R. E. TARJAN, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13, No. 2(1984), 338-355.
• [KMP-77] D.E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM J. COMPUT. 6(1977), 323-350.
• [KR-87] R. KARP AND M. O. RABIN, Efficient randomized pattern-matching algortihms, IBM J. Res. Develop. 31, No.2(1987), 249-260
Reference
42
• [LSV-87] G. M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree, in “Proceedings 14th ICALP,” Lecture Notes in Computer Science Vol. 267, pp. 314-325, Springer-Verlag, New York/Berlin,1987.
• [LV-86a] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching, in “Proc. 18th ACM Symposium on Theory of Computing, 1986,” pp. 220-230.
• [LV-86b] G. M. Landau and U. Vishkin, Efficient string with k mismatches, Theoret. Comput. Sci.,43(1986), 239-249.
• [LV-88] G. M. LANDAU AND VISHKIN, Fast string matching with k differences, J. Comput. System Sci. 37(No. 1), 1988,63-78
• [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.
• [SK-83] D. SANKOFF AND J. B. KURSKAL (Eds.),”Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison,” Addison-Wesley, Reading, MA, 1983.
• [SV-88] B. SCHIEBER AND U. VISHIN, Parallel computation of lowest common ancestor in trees, SIAM J. Comput., in press.
• [U-83]E. UKKONEN, On approximate string matching, in press. In “Proceedings Int. Conf. Found. Comput. Theory,” Lecture Notes in Computer Science Vol. 158, pp. 487-495, Springer-Verlag, Berlin/New York, 1983.
• [U-85] E. UKKONEN, Finding approximate pattern in strings, J. Algorithms 6(1985),132-137.
• [V-83] U. VISHKIN, “Synchronous parallel computation-A survey,” TR-71, Department of Computer Science, Courant Institute, NYU, 1983.
• [V-85] U. VISHKIN, Optimal parallel pattern matching in strings, in “Proceedings 12th ICALP,” Lecture Notes in Computer Science Vol. 194, pp. 497-508, Springer-Verlag, New York/Berlin, Inform. and Control 67(1985, 91-113.)
43
Thank you
top related