cse 30331 lecture 23 – string matching simple (brute-force) approach knuth-morris-pratt algorithm...
Post on 16-Dec-2015
227 Views
Preview:
TRANSCRIPT
CSE 30331Lecture 23 – String Matching
Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
The Problem
Find the first occurrence of the pattern P in text T.
The number of characters in P is m
The number of characters in T is n
The Simple Approach For each position j in the text If T[ j .. j+m) matches P[0..m) stop : pattern found at position j
Advantage: simple to increment
Disadvantage: may require ability to push previously read characters back into
input stream Worst Case Efficiency: O(m*n)
The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character
Knuth-Morris-Pratt (KMP)
Based on FSA for recognizing the pattern P
The FSA is represented by a KMP flowchart States are letters in the pattern P Arcs are SUCCESS or FAIL On success ( T[ j ] == P[ k ] )
move forward with match ( j++ & k++ ) On failure ( T[ j ] != P[ k ] )
Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the longest matching prefix
KMP Fail Links: hubbahubba
Example pattern: hubbahubba P: H U B B A H U B B A K: 0 1 2 3 4 5 6 7 8 9 Fail[k] -1 0 0 0 0 0 1 2 3 4
Match to text: hubbahubbletelescope... hubbahubba last A != L fail[9]= 4 hubbahubba first A != L fail[4]= 0 hubbahubba H != L fail[0]= -1 hubbahubba hubbahubbletelescope... ^
KNP – Building Fail Links Pattern: ABABDD If P [ k ] != T [ j ] then
Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ]
Finding fail[k]: Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat
Read charA B A B D D *
0 1 2 3 4 5
KNP – Building Fail Linksvoid kmpSetup(char P[], int m, int fail[]){ int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // }}
KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0
void kmpSetup(char P[], int m, int fail[])
{ int k, s;
fail[0] = -1; // ch != P[0], read another ch
for (k=1; k<m; k++) { // for P[1]:‘B’
s = fail[k-1]; // s is fail[0]:-1
while(s >= 0) { // skip loop
if (P[s] == P[k-1]) //
break; //
s = fail[s]; //
}
fail[k] = s + 1; // set fail[1] = -1 + 0 = 0
}
}
Read char
0 1 2 3 4 5
A B A B D D *
KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0
void kmpSetup(char P[], int m, int fail[])
{ int k, s;
fail[0] = -1; // ch != P[0], read another ch
for (k=1; k<m; k++) { // for P[2]:‘A’
s = fail[k-1]; // s is fail[1]:0
while(s >= 0) { // loop once
if (P[s] == P[k-1]) // P[0]:’A’ != p[1]:’B’
break; //
s = fail[s]; // so s is fail[0]:-1
}
fail[k] = s + 1; // fail[2] = -1+1 = 0
}
}
Read char
0 1 2 3 4 5
A B A B D D *
KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1
void kmpSetup(char P[], int m, int fail[])
{ int k, s;
fail[0] = -1; // ch != P[0], read another ch
for (k=1; k<m; k++) { // for P[3]:‘B’
s = fail[k-1]; // s is fail[2]:0
while(s >= 0) { // loop once
if (P[s] == P[k-1]) // P[0]:‘A’ == P[2]:‘A’
break; // so, break
s = fail[s]; //
}
fail[k] = s + 1; // fail[3] = 0+1 = 1
}
}
Read char
0 1 2 3 4 5
A B A B D D *
KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2
void kmpSetup(char P[], int m, int fail[])
{ int k, s;
fail[0] = -1; // ch != P[0], read another ch
for (k=1; k<m; k++) { // for P[4]:‘D’
s = fail[k-1]; // s is fail[3]:1
while(s >= 0) { // loop once
if (P[s] == P[k-1]) // P[1]:‘B’ == P[3]:‘B’
break; // so, break
s = fail[s]; //
}
fail[k] = s + 1; // fail[4] = 1+1 = 2
}
}
Read char
0 1 2 3 4 5
A B A B D D *
KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2 0
void kmpSetup(char P[], int m, int fail[])
{ int k, s;
fail[0] = -1; // ch != P[0], read another ch
for (k=1; k<m; k++) { // for P[5]:‘D’
s = fail[k-1]; // s is fail[4]:2
while(s >= 0) { // loop twice
if (P[s] == P[k-1]) // P[2]:‘A’ != P[4]:‘D’, P[0]:‘A’ != P[4]:‘D’
break; //
s = fail[s]; // s = fail[2]:0, s = fail[0]:-1
}
fail[k] = s + 1; // fail[5] = -1+1 = 0
}
}
Read char
0 1 2 3 4 5
A B A B D D *
KMP Fail Links:on mismatch, new k = fail[k]
Example pattern: ABABDD fail: -1 0 0 1 2 0
ABABDD .ABABDD A != X so fail[0] = -1
X????? X????? Skip X & k=0
ABABDD .ABABDD B != X so fail[1] = 0 AX???? AX?????? k=0 (shifts pattern 1)
ABABDD ..ABABDD 2nd A != X so fail[2] = 0 ABX??? ABX??? k=0 (shifts pattern 2)
ABABDD ..ABABDD 2nd B != X so fail[3] = 1 ABAX?? ABAX???? k=1 (shifts pattern 2)
KMP Fail Links:on mismatch, new k = fail[k]
Example pattern: ABABDD fail: -1 0 0 1 2 0
ABABDD ..ABABDD D != X sofail[4] = 2
ABABX? ABABX? k=2 (shifts pattern 2)
ABABDD .....ABABDD 2nd D != X sofail[5] = 0
ABABDX ABABDX k=0 (shifts pattern 5)
KNP Scan Algorithmint kmpScan (char P[], char T[], int m, int fail[]){ int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match;}
KNP - Efficiency
Building Fail Links – O(m)
Scanning text – O(n)
Overall – O(m+n) = O(n)
Boyer-Moore (BM) Heuristic # 1
Match pattern Right-to-Left Create a charJump[ch] array with entry for each character
in the alphabet (ASCII code) If T[ j ] != P[ k ] then
If T[ j ] appears in P[0..k) then the rightmost occurrence is aligned with T[ j ]
Else the pattern P is aligned beginning at T[ j+1 ]
Jnew = charJump[ T[ j ] ] matching resumes with T[ jnew ] and P[m-1] This skips multiple text characters WITHOUT ever
examining them
Boyer Moore Algorithm
Heuristic # 2 MatchJump[k] = slide[k] + m – k
Slide[k] is amount of slide to align substrings M-k is length of suffix (substring) being realigned
Similar to KMP fail links, but calculated right to left If a suffix has matched in P & T and that same substring
appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T
Matching resumes at the new end of the pattern determined by matchJump [ k ]
BM - Example
Pattern: BATSANDCATS
BATSANDCATS first Pattern alignment BATSANDCATS charJump[T[j]] aligns N’s BATSANDCATS matchJump[k] aligns ATS’s TWOOLDGNATSCANBELIKEBATSANDCATS The Text
New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I)
Use MAX(charJump(T[j]),matchJump[k])
Computing individual charJumps// find cJ[ch] for each character ch in pattern Pvoid computeJumps (char P[], int m, int alpha, int charJump[]){ // assume jump distance is entire pattern length for all // characters that do not match a pattern letter. for (int ch=0; ch<alpha; ch++) charJump[ch] = m;
// for each pattern letter find the minimum jump to align // rightmost occurrence in string, with same current char // in the text for (int k=0;k<m; k++) charJump[(int)P[k]] = m - (k + 1);}
Computing substring matchJumpsvoid computeMatchJumps (char P[], int m, int matchJump[]){ int k, s, low, shift, *sufx = new int[m+1]; // note: sufx[0] tells what suffix matches a prefix of P for (k=0;k<m; k++) matchJump[k] = m + 1; // initially, an impossibly large slide // Compute sufx links (like KMP fail links, but right-to-left // Detect if substring equals matched suffix and is preceded // by mismatch at s; compute its slide. sufx[m] = m + 1; for (k=m-1; k>=0; k--) // k indexes sufx array, k-1 indexes P and matchJump { s = sufx[k+1]; while (s <= m) { if (P[k] == P[s-1]) // P indices 0..m-1, sufx indices 0,1..m break; if (s-(k+1) < matchJump[s-1]) // Mismatch between P[k] and P[s-1] matchJump[s-1] = s-(k+1); s = sufx[s]; } sufx[k] = s - 1; }
Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix. Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; }
// Add number of matched characters to slide amount for (k=0; k<m; k++) matchJump[k] += (m-(k+1));}
BM Scan Algorithmint boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]){
int match = -1, j = m-1, k = m-1;while (! endOfText(T,j)){
if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match;}
BM - Example Pattern: WOWWOW mJump: 876731 cJump: ‘W’=0, ‘O’=1, others=6
WOWTHISISWOWXOWWOWWOW the TEXT (21 chars) 1 1111111111121 # of comparisons (15) WOWWOW W != I, cJ[I]=6, mJ[5]=1 WOWWOW W != S, cJ[S]=6, mJ[2]=6 WOWWOW W != X, cJ[X]=6, mJ[3]=7 WOWWOW W != O, cJ[O]=1, mJ[5]=1 WOWWOW match
Note: cJump[‘W’]=0 means simply that if the TEXT character is ‘W’ the pattern realignment placing the rightmost pattern ‘W’ over the text ‘W’ is achieved by not moving the pattern
Note: the algorithm will NOT work using only cJump
BM Algorithm Efficiency
Building charJump[ ] – O() Building matchJump[ ] – O(m)
Scanning text – O(n) In practice, only every 3 or 4 characters are
examined in text so BM is quite fast
Overall – O(n)
String Matching Program
Program to demonstrate all three approaches to string matching
demos\strScan.cpp
top related