string matching algorithms - cs.wmich.eduelise/courses/cs631/sp19/stringmatching2019.pdf · •the...

50
String Matching Algorithms Sirwe Saeedi Spring 2019 Advanced Algorithms and Data Structure https://www.google.com/search?q=question+at+the+end+of+slide&tbm=isch&tbs=rimg:CSVuLLlqcQL3IjiiRAzI700j8YOM-DP3mSu_16Cut gg7ZrkEj0CYhz4UkHrU8GfokvjEWFacz7m269dIe0L5ORu6VgioSCaJEDMjvTSPxEa56d-APlP9fKhIJg4z4M_

Upload: lenguyet

Post on 14-Aug-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

String Matching Algorithms

Sirwe Saeedi Spring 2019

Advanced Algorithms and Data Structure

https://www.google.com/search?q=question+at+the+end+of+slide&tbm=isch&tbs=rimg:CSVuLLlqcQL3IjiiRAzI700j8YOM-DP3mSu_16Cut gg7ZrkEj0CYhz4UkHrU8GfokvjEWFacz7m269dIe0L5ORu6VgioSCaJEDMjvTSPxEa56d-APlP9fKhIJg4z4M_

• BioInformatics

• DNA sequencing

Applications 1

• Web pages search engine

Applications 2

• A text as an array of characters T[1..n]

• A pattern as an array of characters P[1..m]

• m<=n

• The characters

Formalize String Matching Problem

L O LL EO L O LE L OHT[1..15]

E L OH LP[1..5]

Formalize String Matching Problem

L O LL EO L O LE L OHT[1..15]

E L OH LP[1..5]

Formalize String Matching Problem

L O LL EO L O LE L OHT[1..15]

E L OH LP[1..5]

Formalize String Matching Problem

L O LL EO L O LE L OHT[1..15]

E L OHP[1..5] L

T[11

] = P

[1]

T[12

] = P

[2]

T[13

] = P

[3]

T[14

] = P

[4]

T[15

] = P

[5]

String Matching Problem

s = 10 First occurrence of pattern

Check P with each substring of T for all possible shifts

E L OHP[1..5] L

for s=0 test T[1..5] = P[1..5]

L O LL EO L O LE L OHT[1..15]

Naive String MatchingAlgorithm

E L OHP[1..5] L

for s=1 test T[2..5+1] = P[1..5]

check P with each substring of T for all possible shifts

L O LL EO L O LE L OHT[1..15]

Naive String MatchingAlgorithm

E L OHP[1..5] L

for s=0 test T[3..5+2] = P[1..5]

check P with each substring of T for all possible shifts

L O LL EO L O LE L OHT[1..15]

Naive String MatchingAlgorithm

L O LL EO L O LE L OHT[1..15]

E L OHP[1..5] L

for s=0 test T[11..5+10] = P[1..5]

check P with each substring of T for all possible shifts

Naive String MatchingAlgorithm

https://labs.xjtudlc.com/labs/wldmt/reading%20list/books/Algorithms%20and%20optimization/Introduction%20to%20Algorithms.pdf

Naive String MatchingAlgorithm

Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)

a a aa aa a a a . a. a a.Text = a^n

a a aaPattern = a^m a

Naive String Matching Algorithm

Time Complexity

Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)

a a aa aa a a a . a. a a.Text = a^n

a a aaPattern = a^m a

Naive String Matching Algorithm

Time Complexity

Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)

a a aa aa a a a . a. a a.Text = a^n

a a aaPattern = a^m a

Naive String Matching Algorithm

Time Complexity

Matching time in the worst case: O(m(n-m+1)) ~ O(n^2)

a a aa aa a a a . a. a a.Text = a^n

a a aaPattern = a^m a

Naive String Matching Algorithm

Time Complexity

• The Rabin-Karp algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to be compared.

• If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence.

• If the hash values are equal, the algorithm will compare the patternand the M-character sequence.

• In this way, there is only one comparison per text subsequence, andcharacter matching is only needed when hash values match.

Rabin-Karp String Matching Algorithm

• Consider an M-character sequence as an M-digit number in base b, where b is the number of letters in the alphabet. The subsequent t[i..i+M-1] is mapped to the number:

x(i) = t[i]*b^(M-1) + t[i+1]*b^(M-2) + … + t[i+M-1]

• Furthermore, given x(i) we can compute x(i+1) for the next subsequent t[i+1..i+M] in constant time, as follows:

x(i+1) = t[i+1]*b^(M-1) + t[i+2]*b^(M-2) + … + t[i+M]

Some mathematics

• x(i+1) = x(i)*b ———> Shift left one digit -t[i]*b^M ———> Subtract leftmost digit +t[i+M] ———> Add new rightmost digit

• We adjust the existing value when we move over one character

• Constant time to compute M-digit numbers of each M-characters subsequence

Some mathematics

• We hash the value by taking it mod a prime number q The mod function is useful in this case:

1. [(x mod q) + (y mod q)] mod q = (x+y) mod q2. (x mod q) mod q = x mod q

• For these reasons:hash(x(i)) = ((t[i]*b^(M-1) mod q) + (t[i+1]* b^(M-2) mod q) +

… + (t[i+M-1] mod q)) mod q• So: h(x(i+1)) = ( h(x(i)*b mod q -t[i]*b^M mod q +t[i+M] mod q) mod q

Some mathematics

https://labs.xjtudlc.com/labs/wldmt/reading%20list/books/Algorithms%20and%20optimization/Introduction%20to%20Algorithms.pdf

Rabin-Karp String Matching Algorithm

a a bb ab c aText = ‘aabbcaba’

acPattern = ‘cab’ bhash(‘cab’) = 0

hash(‘aab’) = 3

a a bb ab c aText = ‘aabbcaba’

acPattern = ‘cab’ bhash(‘cab’) = 0

hash(‘abb’) = 0

Rabin-Karp AlgorithmExample

a a bb ab c aText = ‘aabbcaba’

acPattern = ‘cab’ bhash(‘cab’) = 0

hash(‘bbc’) = 3

a a bb ab c aText = ‘aabbcaba’

acPattern = ‘cab’ b

hash(‘cab’) = 0

hash(‘bca’) = 0

Rabin-Karp AlgorithmExample

a a bb ab c aText = ‘aabbcaba’

acPattern = ‘cab’ bhash(‘cab’) = 0

hash(‘aba’) = 0

hash(‘cba’) = 0

Collision happened in hashing But the algorithm handles it

a a bb ab c a

ac bhash(‘cab’) = 0

Text = ‘aabbcaba’

Pattern = ‘cab’

Rabin-Karp AlgorithmExample

Matching time in the worst case

O(m(n-m+1)) ~ O(n^2)

Time Complexity

Performs better in average casepreprocessing time

O(m)

•Knuth-Morris-Pratt Algorithm

•Improves the worst case time complexity to O(n)

•Use degenerating property of the pattern

KMP String Matching Algorithm

AB A A B A

Initial Position

KMP AlgorithmExample

A A AA A

A A AA

AB A A B A

Pattern shifted one position

KMP AlgorithmExample

A A AA A

A A AA

AB A A B A

Pattern shifted one position

Need preprocessing of pattern

KMP AlgorithmExample

A A AA A

A A AA

•text = T[1..n]

•pattern = P[1..m]

•LPS = [1..m]

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

LPS[i]length of maximum matching prefix(suffix) of pattern[0..i]

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

0

LPS[0] = 0

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

0

LPS[0] = 0LPS[1] = 0

0

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

0

LPS[0] = 0LPS[1] = 0LPS[2] = 0

0 0

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

0

LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] =

0 0

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

0 1 2 3 4

0

LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] = 1

0 0 1

KMP AlgorithmPreprocessing

•pattern[]

•LPS[]

A B AX B

2

0 1 2 3 4

0

LPS[0] = 0LPS[1] = 0LPS[2] = 0LPS[3] = 1LPS[4] = 2

0 0 1

KMP AlgorithmPreprocessing

• To search pattern in the main text use the LPS array

• For each value of LPS we can decide which next characters should be matched

• The idea is not matching characters that we already know match anyway

KMP AlgorithmSearching the Pattern

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

A B BA AX B X A B

KMP AlgorithmSearching the Pattern

A B AX B

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

BA X A B

A B AX B

A B AX B

KMP AlgorithmSearching the Pattern

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

BA X A B

A B AX B

A B AX B

KMP AlgorithmSearching the Pattern

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

BA X A B

A B AX B

A B AX B

KMP AlgorithmSearching the Pattern

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

A B AX B

A B AX B BA X A B

KMP AlgorithmSearching the Pattern

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

A B AX B

A B AX B BA X A B

KMP AlgorithmSearching the Pattern

Current Character

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

A B AX B

A B AX B BA X A B

KMP AlgorithmSearching the Pattern

Substring behind the current characterpattern[0..1] = ‘AB’

•Text[]

•pattern[]

•LPS[] 2

0 1 2 3 4

0 0 0 1

A B AX B

A B AX B BA X A B

KMP AlgorithmSearching the Pattern

References

• Introduction to Algorithms Third Edition, Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein

•https://www.ics.uci.edu/~eppstein/161/960227.html

•https://www.nayuki.io/

Thank you any questions

https://www.google.com/search?q=question+at+the+end+of+slide&tbm=isch&tbs=rimg:CSVuLLlqcQL3IjiiRAzI700j8YOM-DP3mSu_16Cut gg7ZrkEj0CYhz4UkHrU8GfokvjEWFacz7m269dIe0L5ORu6VgioSCaJEDMjvTSPxEa56d-APlP9fKhIJg4z4M_

Back up