cse 30331 lecture 23 – string matching simple (brute-force) approach knuth-morris-pratt algorithm...

CSE 30331Lecture 23 – String Matching

Simple (Brute-Force) Approach Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm

The Problem

Find the first occurrence of the pattern P in text T.

The number of characters in P is m

The number of characters in T is n

The Simple Approach For each position j in the text If T[ j .. j+m) matches P[0..m) stop : pattern found at position j

Advantage: simple to increment

Disadvantage: may require ability to push previously read characters back into

input stream Worst Case Efficiency: O(m*n)

The pattern is moved forward only one position each time a mismatch is found, no matter how much of the pattern matched prior to the mismatch character

Knuth-Morris-Pratt (KMP)

Based on FSA for recognizing the pattern P

The FSA is represented by a KMP flowchart States are letters in the pattern P Arcs are SUCCESS or FAIL On success ( T[ j ] == P[ k ] )

move forward with match ( j++ & k++ ) On failure ( T[ j ] != P[ k ] )

Move backward in the pattern (or shift the pattern forward over the text) to align the rightmost character P [ fail [ k ] ] with text character T [ j ] preserving the longest matching prefix

KMP Fail Links: hubbahubba

Example pattern: hubbahubba P: H U B B A H U B B A K: 0 1 2 3 4 5 6 7 8 9 Fail[k] -1 0 0 0 0 0 1 2 3 4

Match to text: hubbahubbletelescope... hubbahubba last A != L fail[9]= 4 hubbahubba first A != L fail[4]= 0 hubbahubba H != L fail[0]= -1 hubbahubba hubbahubbletelescope... ^

KNP – Building Fail Links Pattern: ABABDD If P [ k ] != T [ j ] then

Knew = fail [ k ] is the position of the pattern character with the longest prefix matching the text T prior to the mismatch character T [ j ]

Finding fail[k]: Go to P [ k-1 ] & find its fail [ k-1 ] (prefix that matches up to T[ k-2 ] ) If P [ fail[k-1] ] matches P[k-1], then fail [ k ] becomes P[ fail[k-1] ] + 1 Else follow next fail arrow fail [ fail [ k-1 ] ] and repeat

Read charA B A B D D *

0 1 2 3 4 5

KNP – Building Fail Linksvoid kmpSetup(char P[], int m, int fail[]){ int k, s; fail[0] = -1; // ch != P[0], read another ch for (k=1; k<m; k++) // for each P[k], left to right { s = fail[k-1]; // s is previous fail link while(s >= 0) // if not back to start { if (P[s] == P[k-1]) // duplicate char found break; // so, stop following links s = fail[s]; // follow next fail link } fail[k] = s + 1; // }}

KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0

void kmpSetup(char P[], int m, int fail[])

{ int k, s;

fail[0] = -1; // ch != P[0], read another ch

for (k=1; k<m; k++) { // for P[1]:‘B’

s = fail[k-1]; // s is fail[0]:-1

while(s >= 0) { // skip loop

if (P[s] == P[k-1]) //

break; //

s = fail[s]; //

fail[k] = s + 1; // set fail[1] = -1 + 0 = 0

Read char

0 1 2 3 4 5

A B A B D D *

KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0

{ int k, s;

for (k=1; k<m; k++) { // for P[2]:‘A’

s = fail[k-1]; // s is fail[1]:0

while(s >= 0) { // loop once

if (P[s] == P[k-1]) // P[0]:’A’ != p[1]:’B’

break; //

s = fail[s]; // so s is fail[0]:-1

fail[k] = s + 1; // fail[2] = -1+1 = 0

Read char

0 1 2 3 4 5

A B A B D D *

KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1

{ int k, s;

for (k=1; k<m; k++) { // for P[3]:‘B’

if (P[s] == P[k-1]) // P[0]:‘A’ == P[2]:‘A’

break; // so, break

s = fail[s]; //

fail[k] = s + 1; // fail[3] = 0+1 = 1

Read char

0 1 2 3 4 5

A B A B D D *

KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2

{ int k, s;

for (k=1; k<m; k++) { // for P[4]:‘D’

if (P[s] == P[k-1]) // P[1]:‘B’ == P[3]:‘B’

break; // so, break

s = fail[s]; //

fail[k] = s + 1; // fail[4] = 1+1 = 2

Read char

0 1 2 3 4 5

A B A B D D *

KNP – Building Fail Links Pattern: A B A B D D Fail: -1 0 0 1 2 0

{ int k, s;

for (k=1; k<m; k++) { // for P[5]:‘D’

while(s >= 0) { // loop twice

if (P[s] == P[k-1]) // P[2]:‘A’ != P[4]:‘D’, P[0]:‘A’ != P[4]:‘D’

break; //

s = fail[s]; // s = fail[2]:0, s = fail[0]:-1

fail[k] = s + 1; // fail[5] = -1+1 = 0

Read char

0 1 2 3 4 5

A B A B D D *

KMP Fail Links:on mismatch, new k = fail[k]

Example pattern: ABABDD fail: -1 0 0 1 2 0

ABABDD .ABABDD A != X so fail[0] = -1

X????? X????? Skip X & k=0

ABABDD .ABABDD B != X so fail[1] = 0 AX???? AX?????? k=0 (shifts pattern 1)

ABABDD ..ABABDD 2nd A != X so fail[2] = 0 ABX??? ABX??? k=0 (shifts pattern 2)

ABABDD ..ABABDD 2nd B != X so fail[3] = 1 ABAX?? ABAX???? k=1 (shifts pattern 2)

KMP Fail Links:on mismatch, new k = fail[k]

Example pattern: ABABDD fail: -1 0 0 1 2 0

ABABDD ..ABABDD D != X sofail[4] = 2

ABABX? ABABX? k=2 (shifts pattern 2)

ABABDD .....ABABDD 2nd D != X sofail[5] = 0

ABABDX ABABDX k=0 (shifts pattern 5)

KNP Scan Algorithmint kmpScan (char P[], char T[], int m, int fail[]){ int match = -1; // position of match in text int j = 0, k = 0; while (! atEndOfText(T,j)) { // there is more text if (k == m) { match = j - m; // matched entire pattern, so stop break; } if (k == -1) { // nothing in pattern matched last text char, so j++; // get next text character k = 0; // start pattern over } else if (T[j] == P[k]) { j++; k++; // move forward one character in pattern and text } else { k = fail[k]; // follow fail link to best restart in pattern } } return match;}

KNP - Efficiency

Building Fail Links – O(m)

Scanning text – O(n)

Overall – O(m+n) = O(n)

Boyer-Moore (BM) Heuristic # 1

Match pattern Right-to-Left Create a charJump[ch] array with entry for each character

in the alphabet (ASCII code) If T[ j ] != P[ k ] then

If T[ j ] appears in P[0..k) then the rightmost occurrence is aligned with T[ j ]

Else the pattern P is aligned beginning at T[ j+1 ]

Jnew = charJump[ T[ j ] ] matching resumes with T[ jnew ] and P[m-1] This skips multiple text characters WITHOUT ever

examining them

Boyer Moore Algorithm

Heuristic # 2 MatchJump[k] = slide[k] + m – k

Slide[k] is amount of slide to align substrings M-k is length of suffix (substring) being realigned

Similar to KMP fail links, but calculated right to left If a suffix has matched in P & T and that same substring

appears elsewhere in P, then upon a mismatch the pattern P is “slid” to align the rightmost such matching substring with the suffix in T

Matching resumes at the new end of the pattern determined by matchJump [ k ]

BM - Example

Pattern: BATSANDCATS

BATSANDCATS first Pattern alignment BATSANDCATS charJump[T[j]] aligns N’s BATSANDCATS matchJump[k] aligns ATS’s TWOOLDGNATSCANBELIKEBATSANDCATS The Text

New j (where matching resumes) is at end of pattern P, but which (S =?= A) or (S =?= I)

Use MAX(charJump(T[j]),matchJump[k])

Computing individual charJumps// find cJ[ch] for each character ch in pattern Pvoid computeJumps (char P[], int m, int alpha, int charJump[]){ // assume jump distance is entire pattern length for all // characters that do not match a pattern letter. for (int ch=0; ch<alpha; ch++) charJump[ch] = m;

// for each pattern letter find the minimum jump to align // rightmost occurrence in string, with same current char // in the text for (int k=0;k<m; k++) charJump[(int)P[k]] = m - (k + 1);}

Computing substring matchJumpsvoid computeMatchJumps (char P[], int m, int matchJump[]){ int k, s, low, shift, *sufx = new int[m+1]; // note: sufx[0] tells what suffix matches a prefix of P for (k=0;k<m; k++) matchJump[k] = m + 1; // initially, an impossibly large slide // Compute sufx links (like KMP fail links, but right-to-left // Detect if substring equals matched suffix and is preceded // by mismatch at s; compute its slide. sufx[m] = m + 1; for (k=m-1; k>=0; k--) // k indexes sufx array, k-1 indexes P and matchJump { s = sufx[k+1]; while (s <= m) { if (P[k] == P[s-1]) // P indices 0..m-1, sufx indices 0,1..m break; if (s-(k+1) < matchJump[s-1]) // Mismatch between P[k] and P[s-1] matchJump[s-1] = s-(k+1); s = sufx[s]; } sufx[k] = s - 1; }

Computing substring matchJumps // if no suffix match at k+1, compute slide based on prefix that // matches suffix. Prefix length = (m - shift). low = 1; shift = sufx[0]; while (shift <= m) { for (k=low; k<=shift; k++) { if (shift < matchJump[k-1]) matchJump[k-1] = shift; } low = shift + 1; shift = sufx[shift]; }

// Add number of matched characters to slide amount for (k=0; k<m; k++) matchJump[k] += (m-(k+1));}

BM Scan Algorithmint boyerMooreScan (char P[], char T[], int m, int charJump[], int matchJump[]){

int match = -1, j = m-1, k = m-1;while (! endOfText(T,j)){

if (k < 0) { match = j + 1; break; // entire pattern matches, so stop } if (T[j] == P[k]) { j--; k--; // continue match right-to-left } else { jump = matchJump[k]; if (charJump[(int)t[j]] > matchJump[k]) jump = charJump[(int)t[i]]; j += jump; // jump forward & restart matching at right k = m-1; } } return match;}

BM - Example Pattern: WOWWOW mJump: 876731 cJump: ‘W’=0, ‘O’=1, others=6

WOWTHISISWOWXOWWOWWOW the TEXT (21 chars) 1 1111111111121 # of comparisons (15) WOWWOW W != I, cJ[I]=6, mJ[5]=1 WOWWOW W != S, cJ[S]=6, mJ[2]=6 WOWWOW W != X, cJ[X]=6, mJ[3]=7 WOWWOW W != O, cJ[O]=1, mJ[5]=1 WOWWOW match

Note: cJump[‘W’]=0 means simply that if the TEXT character is ‘W’ the pattern realignment placing the rightmost pattern ‘W’ over the text ‘W’ is achieved by not moving the pattern

Note: the algorithm will NOT work using only cJump

BM Algorithm Efficiency

Building charJump[ ] – O() Building matchJump[ ] – O(m)

Scanning text – O(n) In practice, only every 3 or 4 characters are

examined in text so BM is quite fast

Overall – O(n)

String Matching Program

Program to demonstrate all three approaches to string matching

demos\strScan.cpp

cse 30331 lecture 23 – string matching simple (brute-force) approach knuth-morris-pratt algorithm...

Documents

web.engr.oregonstate.eduweb.engr.oregonstate.edu/~huanlian/papers/knuth77.pdftitle:...

charles a. wuethrich...string searching algorithm...

knuth, reference, files

knuth - mathematical writing

knuth-morris-pratt algorithm - indiana state...

patternmatchingalgorithms:...

the knuth-morris-pratt...

knuth catalog 2014_uk

conceptions of proof (knuth)

knuth files

advanced algorithms: text...

cs/coe...

knuth machine tools usa company overview

donald knuth interview part 2

university departments anna university, chennai 600 … ·...

knuth catalog 2015_uk

concurrency - michigan technological university · 1...

tools accessories knuth catalog 2013

simple word problems in universal...

knuth don.oral history.2007.102658053 all