a fast string searching algorithm robert s. boyer, and j strother moore. communication of the acm,...

A Fast String Searching Algorithm

Robert S. Boyer, and J Strother Moore.

Communication of the ACM, vol.20 no.10 , Oct. 1977

Outline: Introduction The Knuth-Morris-Pratt algorithm The Boyer-Moore algorithm

Bad Character heuristic Good Suffix heuristic Matching Algorithm

Experimental Result Conclusion

Introduction String Matching:

Searching a pattern from a text or a longer string.

If the pattern exist in the string, return the position of the first character in the substring which match the pattern.

string

s pattern

Introduction (cont.) Some definition:

m : the length of the pattern. n : the length of the string( or text ). s (shift): the distance between first

character of matched substring and start character.

w x : a string w is a prefix of a string x. w x : a string w is a suffix of a string x.

Introduction (cont.) The naive string-matching algorithm:

Time Complexity: Θ((n-m+1)m) in the worse case.

Θ(n2) if m =

for s ← 0 to n-m

do if pattern[1..m] = string[s+1..s+m]

printf “Pattern occurs with shift” s

2

n

Knuth-Morris-Pratt Algorithm

BABCBAABABABCAB string

A CB AABA patterns

q

BABCBAABABABCAB string

A CB AABA patterns’

k

s + q = s’ + k

Knuth-Morris-Pratt Algorithm(cont.) Prefix Function:

f(j) = largest i < j such that P[1..i] = P[j-i+1..j]

0 if I dose not exist.

ABABA Pq

Pk

Pk Pq

A B A

Knuth-Morris-Pratt Algorithm(cont.) Prefix Function Algorithm:

f[1] ←0k←0for q←2 to m do while k>0 and P[k+1] ≠P[q]

do k ← f[k] if P[k+1] = P[q]

then k ← k+1 f[q] = kreturn f[1..m]

Knuth-Morris-Pratt Algorithm(cont.) Example:

Time Complexity: Prefix function : O(m) by amortize analysis Matching function: O(n) Total : O(m+n) Linear Complexity

ABABACABABA

00

1110987654321k

P[k]

f[k] 321 0 54321

The Boyer-Moore Algorithm Symbols used:

Σ : the set of alphabets patlen : the length of pattern m : the last m characters of pattern matched char : the mismatched character

m

……………… string

pattern

char

Characteristic Match pattern from rightmost character of

the pattern to the left most character of the pattern.

Pattern is relatively long, and Σ is reasonably large, this algorithm is likely to be the most efficient string-matching algorithm.

Bad Character heuristic Observation 1:

if the char doesn’t occur in pat:Pattern Shift : j characterString pointer shift: patlen character

Example:

A D C A B C A B A

CBA

Bad Character heuristic (cont.)

Observation 2: The char occur in the pattern

The rightmost char in pattern in position δ1[char] and the pointer to the pattern is in j

If j < δ1 [char] we shift the pattern right by 1

If j > δ1 [char] we shift the pattern right by j- δ1 [char]

δ1 [] is an array which size is the size of Σ

Bad Character heuristic (cont.) Example:

A C B B A C A B C A

A B Cj = 3 and δ1[B] = 2pattern shift 1string pointer shift 1

(m+ pattern shift)

Good Suffix heuristic 2 sequence [c1.. cn] and [d1.. dn] is unify if for j from

1 to patlen, either ci = di or ci = $ or di = $, which $ be a character doesn’t occur in pat.

the position of rightmost plausible reoccurrence, rpr(j) = k , such that [pat(j+1)..pat(patlen)] and [pat(k)..pat(k+patlen – j - 1)] are unify, and either k≤1 or pat(k-1) ≠pat(j)

Good Suffix heuristic (cont.) Example:

Pattern shift : j+1 – rar(j) String pointer shift: m + j + 1 –rar(j)

= strlen – j + j + 1 – rar(j) = δ2[j]

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

$ $ $ $ $ $ $ $ A B X Y C D E X Y-7 -6 -5 -4 -3 -2 3 0 1

j

pat

rpr(j)

Good Suffix heuristic (cont.) Algorithm:

Boyer-Moore Matching Algorithm

i = patlen;if n < patlen return falsej = patlen

While j > 0 do{ if string(i) = pat(j)

j = j-1i = i-1

else i = i + max(δ1(string(i)) , δ2 (j) )

if i > n then return false }

Boyer-Moore Matching Algorithm Time Complexity:

Bad Character heuristic :O(patlen) Good Suffix heuristic : O(patlen) Matching : O(n) Total O(n+patlen)

Experimental Result

Conclusion Boyer-Moore algorithm have sublinear time co

mplexity :O(n+m)

Boyer-Moore is most efficient string matching algorithm when pattern is long and character is reasonably large.

a fast string searching algorithm robert s. boyer, and j strother moore. communication of the acm,...

Documents