sabin m. thomas - string matching algorithms
TRANSCRIPT
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
1/28
String MatchingAlgorithms
Sabin Thomas
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
2/28
History of String Search
The brute force algorithm:
invented in the dawn of computer history
re-invented many times, stillcommon
Knuth & Pratt invented a better one in 1970
published 1976 as Knuth-Morris-Pratt
Boyer & Moore found a better one before 1976
Published 1977Karp & Rabin found a better one in 1980
Published 1987
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
3/28
Brute-force
Worst O(m*n)
Best O(n)
algorithm brute-force:
input: an array of characters, T (the string to be analyzed) , length n
an array of characters, P (the pattern to be searched for), length m
for i := 0 to n-m do
for j := 0 to m-1do
compare T[j] with P[i+j]
ifnot equal, exit the inner loop
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
4/28
Boyer-Moore
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A B C D
p[0] p[1] p[2] p[3]
N
There is no E in the pattern : thus the pattern cant match ifanycharacters lie
under t[3]. So, move four boxes to the right.
A B C E F G A B C D E
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
5/28
Boyer-Moore
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A B C E F G A B C D E
A B C D
p[0] p[1] p[2] p[3]
N
Again, no match. But there is a B in the pattern. So move two boxes to the
right.
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
6/28
Boyer-Moore
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
A B C D
p[0] p[1] p[2] p[3]
YYYY
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
7/28
Boyer-Moore
(Pseudocode)
Compares right to left
2 precomputed functions
Good suffix shift Bad character shift
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
8/28
Boyer-Moore
(Performance)
Performance depends on length of pattern
O(n/m)
Longer patterns = better performance Smallest pattern = m = 1
O(n) linear search
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
9/28
Knuth-Morris-Pratt
searches for occurrences of a "word" W
within a main "text string" S
Bypasses re-examination of previouslymatched characters.
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
10/28
Knuth-Morris-Pratt
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
Y
A B C A B C D A B A B C
NY Y
m = 0
A B C D A B D
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
11/28
Knuth-Morris-Pratt
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
Y
A B C A B C D A B A B C
NY Y
m = 4
A B C D A B D
Y Y Y
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
12/28
Knuth-Morris-Pratt
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
A B C A B C D A B A B C
N
m = 10
A B C D A B D
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
13/28
Knuth-Morris-Pratt
(Example 1)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
p[0] p[1] p[2] ..
Y
A B C A B C D A B A B C
Y
m = 11
A B C ..
Y
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
14/28
Knuth-Morris-Pratt
PseudoCode
Search O(n)
algorithm kmp_search:
Input an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
Output an integer (the zero-based position in S at which W is found)
define variables:
an integer, m 0 (the beginning of the current match in S)
an integer, i 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)
while m + i is less than the length of S, do:
ifW[i] = S[m + i],let i i + 1
ifi equals the length ofW,
return m
otherwise,
let m m + i - T[i],
ifi > 0,
let i T[i]
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
15/28
Knuth-Morris-Pratt
PseudoCode
Partial Table Match O(k)
algorithm kmp_table:
input: an array of characters, W (the word to be analyzed)
an array of integers, T (the table to be filled)
define variables:
an integer, i 2 (the current position we are computing in T) an integer, j 0 (the zero-based index in W of the next character of the current candidate substring)
let T[0] -1, T[1] 0
while i is less than the length ofW, do:
(first case: the substring continues)
ifW[i - 1] = W[j], let T[i] j + 1, i i + 1, j j + 1
(second case: it doesn't, but we can fall back)
otherwise, ifj > 0, letj T[j]
(third case: we have run out of candidates. Note j = 0)
otherwise, let T[i] 0, i i + 1
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
16/28
Karp-Rabin
Slower for Single pattern match
Fast for Multiple pattern match
Trick is Hash compare.
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
17/28
Karp-Rabin
(Example 1)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A C D Bp[0] p[1] p[2] p[3]
A C D E A C A C C D E
Hash(ACDB) = 5
Hash(ACDE) = 10
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
18/28
Karp-Rabin
(Example 1)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A C D Bp[0] p[1] p[2] p[3]
A C D E A C A C C D E
Hash(ACDB) = 5
Hash(CDEA) = 6
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
19/28
Karp-Rabin
(Caveats)
Good Hashing function with few collisions
Hashing result must be small number for
faster compare Rolling Hash
S.T1
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
20/28
Slide 19
S.T1 Rolling Hash - Allows for faster recomputing of the hash. Instead of completely recomputing from scratch, make use of the fact that w
are computing the hash for just an extra letter. Do addition and subtraction to the original hash algorithm
Sabin Thomas, 4/11/2007
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
21/28
Karp-Rabin
(Pseudocode)
algorithm RabinKarp:
Input an array of characters, S, length n
an array of characters sub, length m
hsub := hash(sub[1..m])
hs := hash(s[1..m])
fori from 1 to n-m+1
ifhs = hsub
ifs[i..i+m-1] = sub
return i
hs := hash(s[i+1..i+m])
return not found
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
22/28
Karp-Rabin
(Example 2)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A C D Bp[0] p[1] p[2] p[3]
A C D E A C A C C D E
Hash(ACDB) = 5
Hash(ACDE) = 10
F M D T
j[0] j[1] j[2] j[3]
Hash(FMDT) = 62
S.T2
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
23/28
Slide 21
S.T2 Multiple Pattern SearchSabin Thomas, 4/11/2007
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
24/28
Karp-Rabin
(Example 2)t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
p[0] p[1] p[2] p[3]
A C D E A C A C C D E
Hash(CDEA) = 6
A C D B
Hash(ACDB) = 5
F M D T
j[0] j[1] j[2] j[3]
Hash(FMDT) = 62
S.T3
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
25/28
Slide 22
S.T3 Hashes of the pattern have already been precomputed.Sabin Thomas, 4/11/2007
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
26/28
Karp-Rabin
(Performance)
Single Pattern
BM O(n/m)
KMP O(n)
Karp-Rabin O(mn)
Multiple Pattern
BM, KMP O(n k)
Karp-Rabin O(n + k)
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
27/28
(Applications)
BM - Text Editors search/replace
Karp-Rabin Plagiarism finder
-
8/8/2019 Sabin M. Thomas - String Matching Algorithms
28/28
Questions?