1 parameterized pattern matching by boyer-moore-type algorithms proceedings of the 6 th annual...
TRANSCRIPT
1
Parameterized Pattern Matching by Boyer-Moore-type Algorithms
Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Al
gorithms, 1995, pp. 541 - 550
Brenda S. Baker
Advisor: Prof. R. C. T. Lee
Speaker: Kuei-hao Chen
2
Let us consider two strings:
A=a1a2a3a4a5=xaxby
B=b1b2b3b4b5=bacbc
If the edit distance concept is used, A may be transformed to B by substituting a1 by b1, a3 by b
3 and a5 by b5.
3
In this paper, we define a new transformation in which a character may be substituted by another character. But the substitution is global. That is, if x in A is substituted by a, then every x in A is substituted by a.
4
A=a1a2a3a4a5=xaxby
B=b1b2b3b4b5=bacbc
Consider the above example again. To transform A to B, the first x must be substituted by b. But this is global. Thus,
A’=babbyIt can be easily seen that if this kind of substitution is used, A=xaxby can not be transformed to B.
5
For A=xaxby and B=babbc, A can be transformed to B by substituting x by b and y by c.
6
We define bijection to be a global substitution of a set of distinct characters into another set characters.
A string P p-matches a string Q if P can be transformed to Q by a bijection.
7
Let
A=ababc
B=bcbcd
Then A p-matches B because there is a bijection, namely which transforms A to B.
, , , dccbba
8
On the other hand, for A=ababc and B=bcbdc, A does not p-match B.
It is actually easy to determine whether A p-matches B. Given A=a1a2… aN and B=b1b2…bN. A p-matches B if and only if for every i, if ai=x and bi=y, then if aj=x, bj must be y.
9
For A=ababc and B=bcbcc. It can be seen that every a in A is matched with b and every b is matched c. This is not true for A=ababc and B=bcbdc.
Thus, given a string A and a string B which are of the same length, it is trivial to determine whether A p-matches B.
10
There is another property which is important. If A p-matches B and B p-matches C, then A p-matches C. It is obvious that this is true.
11
This paper considers the following problem:
Given a text T and a pattern P, find all occurrence where P p-matches a substring of T.
For example:
Let
and
We can see that P p-matches strings in T.
T=abcadbcbdabccacbd
P=abaecS1 S2
12
For P=abaec and S2=cacbd, the substitution will transform P to S2.
For S2=cacbd and S1=bcbda, the substitution
transforms S2 to S1.
It can be seen that P=abaec will be transformed to S1=bcbda by
, , , , bedcabca
, , , , caaddbbc
. , , , cbacdeba
13
The substitution can be visualized as follows:
S1 S2T
P
14
This paper is based upon Good suffix rule 1 and Good suffix rule 2 proposed in Boyer and Moore Algorithm.
15
Good Suffix Rule 1 for p-match
Let T1 be the largest suffix which p-matches with a suffix P1 of P. If there is a substring zP2 which is the right most one and p-matches with yP1 , and z≠y, we can move P as follows:
T1T
P
xwindow
P1yP2z
T1T
P
xwindow
P1yP2zshift
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T v v x x v v v v x x w v v w w
P u u u v v v w w v v1 2 3 4 5 6 7 8 9 10
Shift
Example
p-mismatch
P u u u v v v w w v v1 2 3 4 5 6 7 8 9 10
u u u x x x v v x xTransform
P’
17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T v v x x v v v v x x w v v w w
P u u u v v v w w v v1 2 3 4 5 6 7 8 9 10
v v v x x w v v w wTransform
After moving, we compare T and P from right to left. We found out T6,15≡P1,10.
P’
18
Good Suffix Rule 2 for p-match
T
P
xT1
yP1
'1T
'1P'
2P
'1P
Let T1 be the largest suffix of the window of P which p-matches with a suffix P1 of P.
Let be suffix of P1 which p-matches with a prefix P2 of P. If exists, we move P as follows:
'1P
T
P
xT1
'1T
'2Pshift
19
1 2 3 4 5 6 7 8 9 10 11 12 13
T x x v v v v x x w v v w w
P u v v v w w v v1 2 3 4 5 6 7 8
Shift
p-mismatch
P u v v v w w v v3 4 5 6 7 8 9 10
u x x x v v x xTransform
P’
Example
20
1 2 3 4 5 6 7 8 9 10 11 12 13
T x x v v v v x x w v v w w
P u v v v w w v v3 4 5 6 7 8 9 10
u x x x v v x xTransform
P’
21
The shift function ∆ is
) and 2( rulesuffix Good
) and 1(0 rulesuffix Goodmaxmin
1,-1,
1,,1
mm
mjmj
PPmj
PPjm
22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
T G A T C G A T C A A T C A T A T C A T C A T
P A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12
Example
C A T C T C A T C A T CP’
AT
TC
CA
Transform
p-mismatch
j’=7 j=9
P A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12
Shift
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
T G A T C G A T C A A T C A T A T C A T C A T
P A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12
AT
TC
CATransform
p-mismatch
j’=7 j=9
P A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12
Shift
C A T C T C A T C A T CP’
24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
T G A T C G A T C A A T C A T A T C A T C A T
P A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12
CT
AC
TATransform
T C A T A T C A T C A TP’
25
Time Complexity• In average case, the preprocessing phase in O(m
log min(m, Π)) time and space complexity O(n) time complexity and searching phase in O(nlog min(m, Π)) .
26
References
• [AFM94] Amihood Amir, Martin Farach, and S. Muthukrishnan, Alphabet dependence in parameterized matching. Info. Proc. Letters, Vol. 49, pp.111-115, 1994.
• [Bak] Brenda S. Baker, Parameterized pattern matching: algorithms and applications., J. Comput. Syst. Sci. to appear.
• [Bak92] Brenda S. Baker, A program for identifying duplicated code., In Computing Science and Statistics Vol.24: Proceeding of the 24th Symposium on the Interface, pp.49-57, 1992.
• [Bak93a] Brenda S. Baker, Parameterized duplication in strings: algorithms and an application to software maintenance., submitted for publication, 1993.
• [Bak93b] Brenda S. Baker, A theory of parameterized pattern matching: Algorithms and applications, In Proceedings of the 25th Annual Symposium on Theory of Computing, pp.71-80, pp.1993.
• [BM77] Robert S. Boyer and J. Strother Moore, A fast string searching algorithm, Commun. ACM,Vol.20, No.10, pp.762-772, 1977.
27
References
• [BYGR90] Ricardo A. Baeza-Yates, Gaston H. Gonnet, and Mireille Regnier, Analysis of Boyer-Moore-type string searching algorithms. In Proc. of First Annual ACM-SIAM Symposium on Discrete Algorithms, pp.328-343, 1990.
• [BYR92] Ricardo A. Baeza-Yates and Mireille Regnier, Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Sci., Vol. 92, pp.19-31, 1992.
• [CLC+92] Maxime Crochemore, Thierry Lecroq, Artur Czumaj, Leszek Gasieniec, S. Jarominek, and W. Plandowski, Speeding up two string-matching algorithms, In 9th Annual Symposium on Theoretical Aspects of Computer Science, LNCS Vol.577, pp.589-600, 1992.
• [Col 91] Richard Cole. Tight bounds of the complexity of the Boyer-Moore string matching algorithm, In Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms, pp.224-234, pp.1991.
• [Hor 80] R. Nigel Horspool. Practical fast searchingin strings. Soft. Pract. And Exp., Vol.10, pp.501-506, 1980.
28
References
• [HS91] Andrew Hume and Daniel Sunday, Fast string search, Soft. Pract. And Exp., Vol. 21, No.11, pp.1221-1248, 1991.
• [IS94] Ramana M. Idury and Alejandro A. Schaffer. Multiple matching of parameterized patterns. In proc. Of 5th Symposium on Combinatorial Pattern Matching, pp.226-239, 1994.
• [KMP77] D. E. Knuth, J. H. Morries, and V. R. Pratt, Fast pattern matching in strings, SIAM J. Comput., Vol.6, No.2, pp.323-350, 1977.
• [Ryt80] Wojciech Rytter, A correct preprocessing algorithm for Boyer-Moore string-searching, SIAM J. Comput., Vol.9, No.3, pp.509-512, 1980.
• [Sch88] R. Schaback, On the expected sublinearity of the Boyer-Moore algorithm. SIAM J. on Comput., Vol. 17, No.4, pp.648-659, 1988.
• [Sun 90] Daniel M. Sunday, A very fast substring search algorithm, Commun. ACM, Vol.33, No.8, pp132-139, 1990
29
THANK YOU