a fast string searching algorithm robert s. boyer, and j strother moore. communication of the acm,...
Post on 20-Dec-2015
221 views
TRANSCRIPT
A Fast String Searching Algorithm
Robert S. Boyer, and J Strother Moore.
Communication of the ACM, vol.20 no.10 , Oct. 1977
Outline: Introduction The Knuth-Morris-Pratt algorithm The Boyer-Moore algorithm
Bad Character heuristic Good Suffix heuristic Matching Algorithm
Experimental Result Conclusion
Introduction String Matching:
Searching a pattern from a text or a longer string.
If the pattern exist in the string, return the position of the first character in the substring which match the pattern.
string
s pattern
Introduction (cont.) Some definition:
m : the length of the pattern. n : the length of the string( or text ). s (shift): the distance between first
character of matched substring and start character.
w x : a string w is a prefix of a string x. w x : a string w is a suffix of a string x.
Introduction (cont.) The naive string-matching algorithm:
Time Complexity: Θ((n-m+1)m) in the worse case.
Θ(n2) if m =
for s ← 0 to n-m
do if pattern[1..m] = string[s+1..s+m]
printf “Pattern occurs with shift” s
2
n
Knuth-Morris-Pratt Algorithm
BABCBAABABABCAB string
A CB AABA patterns
q
BABCBAABABABCAB string
A CB AABA patterns’
k
s + q = s’ + k
Knuth-Morris-Pratt Algorithm(cont.) Prefix Function:
f(j) = largest i < j such that P[1..i] = P[j-i+1..j]
0 if I dose not exist.
ABABA Pq
Pk
Pk Pq
A B A
Knuth-Morris-Pratt Algorithm(cont.) Prefix Function Algorithm:
f[1] ←0k←0for q←2 to m do while k>0 and P[k+1] ≠P[q]
do k ← f[k] if P[k+1] = P[q]
then k ← k+1 f[q] = kreturn f[1..m]
Knuth-Morris-Pratt Algorithm(cont.) Example:
Time Complexity: Prefix function : O(m) by amortize analysis Matching function: O(n) Total : O(m+n) Linear Complexity
ABABACABABA
00
1110987654321k
P[k]
f[k] 321 0 54321
The Boyer-Moore Algorithm Symbols used:
Σ : the set of alphabets patlen : the length of pattern m : the last m characters of pattern matched char : the mismatched character
m
……………… string
pattern
char
Characteristic Match pattern from rightmost character of
the pattern to the left most character of the pattern.
Pattern is relatively long, and Σ is reasonably large, this algorithm is likely to be the most efficient string-matching algorithm.
Bad Character heuristic Observation 1:
if the char doesn’t occur in pat:Pattern Shift : j characterString pointer shift: patlen character
Example:
A D C A B C A B A
CBA
Bad Character heuristic (cont.)
Observation 2: The char occur in the pattern
The rightmost char in pattern in position δ1[char] and the pointer to the pattern is in j
If j < δ1 [char] we shift the pattern right by 1
If j > δ1 [char] we shift the pattern right by j- δ1 [char]
δ1 [] is an array which size is the size of Σ
Bad Character heuristic (cont.) Example:
A C B B A C A B C A
A B Cj = 3 and δ1[B] = 2pattern shift 1string pointer shift 1
(m+ pattern shift)
Good Suffix heuristic 2 sequence [c1.. cn] and [d1.. dn] is unify if for j from
1 to patlen, either ci = di or ci = $ or di = $, which $ be a character doesn’t occur in pat.
the position of rightmost plausible reoccurrence, rpr(j) = k , such that [pat(j+1)..pat(patlen)] and [pat(k)..pat(k+patlen – j - 1)] are unify, and either k≤1 or pat(k-1) ≠pat(j)
Good Suffix heuristic (cont.) Example:
Pattern shift : j+1 – rar(j) String pointer shift: m + j + 1 –rar(j)
= strlen – j + j + 1 – rar(j) = δ2[j]
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
$ $ $ $ $ $ $ $ A B X Y C D E X Y-7 -6 -5 -4 -3 -2 3 0 1
j
pat
rpr(j)
Boyer-Moore Matching Algorithm
i = patlen;if n < patlen return falsej = patlen
While j > 0 do{ if string(i) = pat(j)
j = j-1i = i-1
else i = i + max(δ1(string(i)) , δ2 (j) )
if i > n then return false }
Boyer-Moore Matching Algorithm Time Complexity:
Bad Character heuristic :O(patlen) Good Suffix heuristic : O(patlen) Matching : O(n) Total O(n+patlen)