string matching algorithms - cs.wmich.eduelise/courses/cs631/sp19/stringmatching2019.pdf · •the...
TRANSCRIPT
String Matching Algorithms
Sirwe Saeedi Spring 2019
Advanced Algorithms and Data Structure
https://www.google.com/search?q=question+at+the+end+of+slide&tbm=isch&tbs=rimg:CSVuLLlqcQL3IjiiRAzI700j8YOM-DP3mSu_16Cut gg7ZrkEj0CYhz4UkHrU8GfokvjEWFacz7m269dIe0L5ORu6VgioSCaJEDMjvTSPxEa56d-APlP9fKhIJg4z4M_
• A text as an array of characters T[1..n]
• A pattern as an array of characters P[1..m]
• m<=n
• The characters
Formalize String Matching Problem
L O LL EO L O LE L OHT[1..15]
E L OHP[1..5] L
T[11
] = P
[1]
T[12
] = P
[2]
T[13
] = P
[3]
T[14
] = P
[4]
T[15
] = P
[5]
String Matching Problem
s = 10 First occurrence of pattern
Check P with each substring of T for all possible shifts
E L OHP[1..5] L
for s=0 test T[1..5] = P[1..5]
L O LL EO L O LE L OHT[1..15]
Naive String MatchingAlgorithm
E L OHP[1..5] L
for s=1 test T[2..5+1] = P[1..5]
check P with each substring of T for all possible shifts
L O LL EO L O LE L OHT[1..15]
Naive String MatchingAlgorithm
E L OHP[1..5] L
for s=0 test T[3..5+2] = P[1..5]
check P with each substring of T for all possible shifts
L O LL EO L O LE L OHT[1..15]
Naive String MatchingAlgorithm
L O LL EO L O LE L OHT[1..15]
E L OHP[1..5] L
for s=0 test T[11..5+10] = P[1..5]
check P with each substring of T for all possible shifts
Naive String MatchingAlgorithm
https://labs.xjtudlc.com/labs/wldmt/reading%20list/books/Algorithms%20and%20optimization/Introduction%20to%20Algorithms.pdf
Naive String MatchingAlgorithm
Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)
a a aa aa a a a . a. a a.Text = a^n
a a aaPattern = a^m a
Naive String Matching Algorithm
Time Complexity
Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)
a a aa aa a a a . a. a a.Text = a^n
a a aaPattern = a^m a
Naive String Matching Algorithm
Time Complexity
Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)
a a aa aa a a a . a. a a.Text = a^n
a a aaPattern = a^m a
Naive String Matching Algorithm
Time Complexity
Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)
a a aa aa a a a . a. a a.Text = a^n
a a aaPattern = a^m a
Naive String Matching Algorithm
Time Complexity
• The Rabin-Karp algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to be compared.
• If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence.
• If the hash values are equal, the algorithm will compare the patternand the M-character sequence.
• In this way, there is only one comparison per text subsequence, andcharacter matching is only needed when hash values match.
Rabin-Karp String Matching Algorithm
• Consider an M-character sequence as an M-digit number in base b, where b is the number of letters in the alphabet. The subsequent t[i..i+M-1] is mapped to the number:
x(i) = t[i]*b^(M-1) + t[i+1]*b^(M-2) + … + t[i+M-1]
• Furthermore, given x(i) we can compute x(i+1) for the next subsequent t[i+1..i+M] in constant time, as follows:
x(i+1) = t[i+1]*b^(M-1) + t[i+2]*b^(M-2) + … + t[i+M]
Some mathematics
• x(i+1) = x(i)*b ———> Shift left one digit -t[i]*b^M ———> Subtract leftmost digit +t[i+M] ———> Add new rightmost digit
• We adjust the existing value when we move over one character
• Constant time to compute M-digit numbers of each M-characters subsequence
Some mathematics
• We hash the value by taking it mod a prime number q The mod function is useful in this case:
1. [(x mod q) + (y mod q)] mod q = (x+y) mod q2. (x mod q) mod q = x mod q
• For these reasons:hash(x(i)) = ((t[i]*b^(M-1) mod q) + (t[i+1]* b^(M-2) mod q) +
… + (t[i+M-1] mod q)) mod q• So: h(x(i+1)) = ( h(x(i)*b mod q -t[i]*b^M mod q +t[i+M] mod q) mod q
Some mathematics
https://labs.xjtudlc.com/labs/wldmt/reading%20list/books/Algorithms%20and%20optimization/Introduction%20to%20Algorithms.pdf
Rabin-Karp String Matching Algorithm
a a bb ab c aText = ‘aabbcaba’
acPattern = ‘cab’ bhash(‘cab’) = 0
hash(‘aab’) = 3
a a bb ab c aText = ‘aabbcaba’
acPattern = ‘cab’ bhash(‘cab’) = 0
hash(‘abb’) = 0
Rabin-Karp AlgorithmExample
a a bb ab c aText = ‘aabbcaba’
acPattern = ‘cab’ bhash(‘cab’) = 0
hash(‘bbc’) = 3
a a bb ab c aText = ‘aabbcaba’
acPattern = ‘cab’ b
hash(‘cab’) = 0
hash(‘bca’) = 0
Rabin-Karp AlgorithmExample
a a bb ab c aText = ‘aabbcaba’
acPattern = ‘cab’ bhash(‘cab’) = 0
hash(‘aba’) = 0
hash(‘cba’) = 0
Collision happened in hashing But the algorithm handles it
a a bb ab c a
ac bhash(‘cab’) = 0
Text = ‘aabbcaba’
Pattern = ‘cab’
Rabin-Karp AlgorithmExample
Matching time in the worst case
O(m(n-m+1)) ~ O(n^2)
Time Complexity
Performs better in average casepreprocessing time
O(m)
•Knuth-Morris-Pratt Algorithm
•Improves the worst case time complexity to O(n)
•Use degenerating property of the pattern
KMP String Matching Algorithm
AB A A B A
Pattern shifted one position
Need preprocessing of pattern
KMP AlgorithmExample
A A AA A
A A AA
•pattern[]
•LPS[]
A B AX B
0 1 2 3 4
LPS[i]length of maximum matching prefix(suffix) of pattern[0..i]
KMP AlgorithmPreprocessing
•pattern[]
•LPS[]
A B AX B
0 1 2 3 4
0
LPS[0] = 0LPS[1] = 0LPS[2] = 0
0 0
KMP AlgorithmPreprocessing
•pattern[]
•LPS[]
A B AX B
0 1 2 3 4
0
LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] =
0 0
KMP AlgorithmPreprocessing
•pattern[]
•LPS[]
A B AX B
0 1 2 3 4
0
LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] = 1
0 0 1
KMP AlgorithmPreprocessing
•pattern[]
•LPS[]
A B AX B
2
0 1 2 3 4
0
LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] = 1LPS[4] = 2
0 0 1
KMP AlgorithmPreprocessing
• To search pattern in the main text use the LPS array
• For each value of LPS we can decide which next characters should be matched
• The idea is not matching characters that we already know match anyway
KMP AlgorithmSearching the Pattern
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
A B BA AX B X A B
KMP AlgorithmSearching the Pattern
A B AX B
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
BA X A B
A B AX B
A B AX B
KMP AlgorithmSearching the Pattern
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
BA X A B
A B AX B
A B AX B
KMP AlgorithmSearching the Pattern
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
BA X A B
A B AX B
A B AX B
KMP AlgorithmSearching the Pattern
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
A B AX B
A B AX B BA X A B
KMP AlgorithmSearching the Pattern
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
A B AX B
A B AX B BA X A B
KMP AlgorithmSearching the Pattern
Current Character
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
A B AX B
A B AX B BA X A B
KMP AlgorithmSearching the Pattern
Substring behind the current characterpattern[0..1] = ‘AB’
•Text[]
•pattern[]
•LPS[] 2
0 1 2 3 4
0 0 0 1
A B AX B
A B AX B BA X A B
KMP AlgorithmSearching the Pattern
References
• Introduction to Algorithms Third Edition, Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein
•https://www.ics.uci.edu/~eppstein/161/960227.html
•https://www.nayuki.io/
Thank you any questions
https://www.google.com/search?q=question+at+the+end+of+slide&tbm=isch&tbs=rimg:CSVuLLlqcQL3IjiiRAzI700j8YOM-DP3mSu_16Cut gg7ZrkEj0CYhz4UkHrU8GfokvjEWFacz7m269dIe0L5ORu6VgioSCaJEDMjvTSPxEa56d-APlP9fKhIJg4z4M_