contest algorithms january 2016 three types of string search: brute force, knuth-morris-pratt (kmp)...

47
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1 Contest Algorithms: 13. String Srch

Upload: tobias-russell

Post on 21-Jan-2016

254 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

1

Contest AlgorithmsJanuary 2016

Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp

13. String Searching

Contest Algorithms: 13. String Srch

Page 2: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Definition: given a text string T and a search string (pattern) P, find P inside

T T: “the rain in spain stays mainly on the plain” P: “n th”

Applications: text editors, Web search engines (e.g. Google), image analysis

1. What is String Searching?

Page 3: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Assume S is a string of size m.

A substring S[i .. j] of S is the string fragment between indexes i and j.

A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1]

i is any index between 0 and m-1

String Concepts

"start of S"

"end of S"

Page 4: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Substring S[1..3] == "ndr"

All possible prefixes of S: "andrew", "andre", "andr", "and", "an”, "a"

All possible suffixes of S: "andrew", "ndrew", "drew", "rew", "ew", "w"

Examplesa n d r e w

S

0 5

Page 5: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Check each position in the text T to see if the pattern P starts in that position

2. The Brute Force Algorithm

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

. . . .P moves 1 char at a time through T

Page 6: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 6

public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute()

Code see BruteSearch.java

Page 7: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 7

Easy to code No preprocessing needs to be done on the pattern

Usually takes O(n+m) steps – not so bad n = length of text; m = length of pattern

Worst case scenario O(nm) when searching for aaabin aaaaaaaaaaaaaaaaaaaaaaaab

Properties of Brute-force Search

Page 8: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm.

steps are bigger than just 1 character move

3. The KMP Algorithm

continued

Page 9: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?

Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Page 10: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Example

T:

P:

jnew = 2

j = 5

i

Page 11: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Find largest prefix (start) of:"a b a a b" ( P[0..j-1] )

which is suffix (end) of:"b a a b" ( p[1 .. j-1] )

Answer: "a b" Set j = 2 // the new j value

Whyj == 5

Page 12: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.

j = mismatch position in P[] k = position before the mismatch (k = j-1).

The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

KMP Failure Function

Page 13: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

P: "a b a a b a" j: 0 1 2 3 4 5

In code, F() is represented by an array, like the table.

Failure Function Example

F(k) is the size of the largest prefix.

1

3

2

4210j

100F(j)

k

F(k)

(k == j-1)

Page 14: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

F(4) means find the size of the largest prefix of P[0..4] that is also a

suffix of P[1..4]= find the size largest prefix of "abaab" that

is also a suffix of "baab"= find the size of "ab"= 2

Why is F(4) == 2?P: "abaaba"

Page 15: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.

if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Using the Failure Function

Page 16: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0; int j=0; :

Code

Return index where pattern starts, or -1

see KmpSearch.java

Page 17: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

Page 18: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0;

int m = pattern.length(); int j = 0; int i = 1; :

Page 19: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code

to kmpMatch()

Page 20: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Example

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

0

3

1

4210k

100F(k)

T:

P:

Page 21: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

F(4) means find the size of the largest prefix of P[0..4] that is also a suffix

of P[1..4]= find the size largest prefix of "abaca" that

is also a suffix of "baca"= find the size of "a"= 1

Why is F(4) == 1?P: "abacab"

Page 22: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 22

Time to find match is only O(n) with O(m) preprocessing time

n = length of text; m = length of the pattern

Can be modified to search for multiple patterns in a single search.

Properties of KMP

Page 23: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

KMP doesn’t work so well as the size of the alphabet increases

more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is

faster when the mismatches occur later

KMP Disadvantage

Page 24: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

The basic algorithm doesn't take into account the letter in the text that caused the mismatch.

KMP Extensions

a a ab b

a a ab b a

x

a a ab b a

T:

P:

Basic KMPdoes not do this.

Page 25: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

String search is based on a hash function applied to the pattern and substrings in the text

Look for a match by comparing the hash values, not substrings.

5. The Rabin-Karp Algorithm

Page 26: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 26

long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; }

R == radix; often 10 for numeric data; 128 for ASCII, etc.

Q == a large prime number; e.g. 997

Typical hash function

hash("26535") == 613

Page 27: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 27

Hash Function explained

Tt0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...

Pp0 p1 p2 p3 pm-2 pm-1... ... ...

pattern has m chars

hash(P)

examine m char of text at a time = Xi

hash(Xi)

Page 28: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 28

The hash function calculates: hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q

Page 29: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 29

T = "31415926535" and P = "26" R = 10; Q = 11 hash("ab") = (a*10 + b) mod 11

Example

13 14 95 62 35 5T

62P hash(P) == hash("26") == 26 mod 11 = 4

Page 30: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Iterate through the Text

13 14 95 62 35 5

13 14 95 62 35 5

14 mod 11 = 3 not equal to 4

31 mod 11 = 9 not equal to 4

13 14 95 62 35 5

41 mod 11 = 8 not equal to 4

Page 31: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

13 14 95 62 35 5

15 mod 11 = 4 equal to 4 -> wrong match

13 14 95 62 35 5

59 mod 11 = 4 equal to 4 -> wrong match

13 14 95 62 35 5

92 mod 11 = 4 equal to 4 -> wrong match

13 14 95 62 35 5

26 mod 11 = 4 equal to 4 -> correct match

Page 32: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 32

The hash() function uses modulo Q, so the range of results is 0 to Q-1.

If Q is small then it is likely that two different strings will hash to the same result

probability is 1/Q

Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = 232-1 == 4.3 billion)

Also double-check the match using string operations

Why Wrong Matches?

Page 33: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

This is an example of a Monte Carlo algorithm it's fast but may output an incorrect answer with a small

probability (1/Q)

The "double-checking" approach is known as a Las Vegas algorithm

it can be slow

Gambling Names

Page 34: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 34

After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc.

It is possible to calculate the next hash (e.g. hash(Xi+1)) based on the current hash value (hash(Xi))

much faster (O(m) --> O(1) running time) less memory needed

Speeding up hash Calculation

Page 35: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 35

hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q

hash(Xi+1) = ( t1*Rm-1 + t2*Rm-2 + t3*Rm-1 + ... tm-1*R + tm ) mod Q

Connection between hash()s

T t0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...

Xi

Xi+1

Page 36: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 36

Therefore: hash(Xi+1) = ( ( hash(Xi+1) - t0*Rm-1 ) mod Q )*R

+ tm mod Q ) mod Q

= ( ( hash(Xi+1) + ( t0*Qm-1 - t0*Rm-1 )) mod Q )*R

+ tm mod Q ) mod Q

= ( ( hash(Xi+1) + t0( Q - (Rm-1 mod Q) ) )*R

+ tm ) mod Q

old front value

new end value

include so mod value is positive

a constant,which can be pre-calculated

Page 37: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Using:

Modulo Properties

Page 38: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values.

Creating the Hash

Page 39: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

P: "26535" R = 10, Q = 997

Hash of the Pattern

the hash value for the pattern

Page 40: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

T: "3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3" M = 5, R = 10, Q = 997

Hashing the Text Substrings

In the code RM = Rm-1 mod Q

The hash values forthe M-char substrings

Page 41: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 41

public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp <text> <pattern>"); return; }

RabinKarp searcher = new RabinKarp(args[1]); int pos = searcher.search(args[0]); showPos(args[0], args[1], pos); } // end of main()

Code see RabinKarp.java

Page 42: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 42

public class RabinKarp{ private static final int R = 256; // radix

private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value

private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q

public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime();

// precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() :

Page 43: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 43

private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); }

private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash()

Page 44: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 44

public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M);

// hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0;

// iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q;

// found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search()

Page 45: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 45

private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check()

Page 46: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 46

Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching.

KMP's hashing technique allows the search algorithm to be used on other things than text

e.g. image, audio, video search

Rabin-Karp can be easily modified to do fast multiple pattern search.

check whether the hash of a string in the text belongs to a set of hash values of patterns

Properties of Rabin-Karp

Page 47: Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Contest Algorithms:13. String Srch 47

Algorithm Preprocessing timem = pat len.

Matching time (average, worst)

n = text len;

Brute force 0 (no preprocessing) O(n+m), O(nm)

Knuth-Morris-Pratt O(m) O(n)

Rabin-Karp O(m) O(n+m), O(nm)

6. Summary

35 algorithms with C code at http://www-igm.univ-mlv.fr/~lecroq/string/