contest algorithms january 2016 three types of string search: brute force, knuth-morris-pratt (kmp)...

Contest AlgorithmsJanuary 2016

Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp

13. String Searching

Contest Algorithms: 13. String Srch

Definition: given a text string T and a search string (pattern) P, find P inside

T T: “the rain in spain stays mainly on the plain” P: “n th”

Applications: text editors, Web search engines (e.g. Google), image analysis

1. What is String Searching?

Assume S is a string of size m.

A substring S[i .. j] of S is the string fragment between indexes i and j.

A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1]

i is any index between 0 and m-1

String Concepts

"start of S"

"end of S"

Substring S[1..3] == "ndr"

All possible prefixes of S: "andrew", "andre", "andr", "and", "an”, "a"

All possible suffixes of S: "andrew", "ndrew", "drew", "rew", "ew", "w"

Examplesa n d r e w

Check each position in the text T to see if the pattern P starts in that position

2. The Brute Force Algorithm

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

. . . .P moves 1 char at a time through T

Contest Algorithms:13. String Srch 6

public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute()

Code see BruteSearch.java

Easy to code No preprocessing needs to be done on the pattern

Usually takes O(n+m) steps – not so bad n = length of text; m = length of pattern

Worst case scenario O(nm) when searching for aaabin aaaaaaaaaaaaaaaaaaaaaaaab

Properties of Brute-force Search

The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm.

steps are bigger than just 1 character move

3. The KMP Algorithm

continued

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?

Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Example

jnew = 2

Find largest prefix (start) of:"a b a a b" ( P[0..j-1] )

which is suffix (end) of:"b a a b" ( p[1 .. j-1] )

Answer: "a b" Set j = 2 // the new j value

Whyj == 5

KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.

j = mismatch position in P[] k = position before the mismatch (k = j-1).

The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

KMP Failure Function

P: "a b a a b a" j: 0 1 2 3 4 5

In code, F() is represented by an array, like the table.

Failure Function Example

F(k) is the size of the largest prefix.

100F(j)

(k == j-1)

F(4) means find the size of the largest prefix of P[0..4] that is also a

suffix of P[1..4]= find the size largest prefix of "abaab" that

is also a suffix of "baab"= find the size of "ab"= 2

Why is F(4) == 2?P: "abaaba"

Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.

if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Using the Failure Function

int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0; int j=0; :

Return index where pattern starts, or -1

see KmpSearch.java

while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0;

int m = pattern.length(); int j = 0; int i = 1; :

while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code

to kmpMatch()

Example

a b a c a a b a c a b a c a b a a b b

19181715

a b a c a b

2 3 4 5 6

a b a c a b

10 11 12

100F(k)

F(4) means find the size of the largest prefix of P[0..4] that is also a suffix

of P[1..4]= find the size largest prefix of "abaca" that

is also a suffix of "baca"= find the size of "a"= 1

Why is F(4) == 1?P: "abacab"

Time to find match is only O(n) with O(m) preprocessing time

n = length of text; m = length of the pattern

Can be modified to search for multiple patterns in a single search.

Properties of KMP

KMP doesn’t work so well as the size of the alphabet increases

more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is

faster when the mismatches occur later

KMP Disadvantage

The basic algorithm doesn't take into account the letter in the text that caused the mismatch.

KMP Extensions

a a ab b

a a ab b a

Basic KMPdoes not do this.

String search is based on a hash function applied to the pattern and substrings in the text

Look for a match by comparing the hash values, not substrings.

5. The Rabin-Karp Algorithm

long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; }

R == radix; often 10 for numeric data; 128 for ASCII, etc.

Q == a large prime number; e.g. 997

Typical hash function

hash("26535") == 613

Hash Function explained

Tt0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...

Pp0 p1 p2 p3 pm-2 pm-1... ... ...

pattern has m chars

hash(P)

examine m char of text at a time = Xi

hash(Xi)

The hash function calculates: hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q

T = "31415926535" and P = "26" R = 10; Q = 11 hash("ab") = (a*10 + b) mod 11

Example

13 14 95 62 35 5T

62P hash(P) == hash("26") == 26 mod 11 = 4

Iterate through the Text

13 14 95 62 35 5

14 mod 11 = 3 not equal to 4

13 14 95 62 35 5

15 mod 11 = 4 equal to 4 -> wrong match

13 14 95 62 35 5

26 mod 11 = 4 equal to 4 -> correct match

The hash() function uses modulo Q, so the range of results is 0 to Q-1.

If Q is small then it is likely that two different strings will hash to the same result

probability is 1/Q

Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = 232-1 == 4.3 billion)

Also double-check the match using string operations

Why Wrong Matches?

This is an example of a Monte Carlo algorithm it's fast but may output an incorrect answer with a small

probability (1/Q)

The "double-checking" approach is known as a Las Vegas algorithm

it can be slow

Gambling Names

After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc.

It is possible to calculate the next hash (e.g. hash(Xi+1)) based on the current hash value (hash(Xi))

much faster (O(m) --> O(1) running time) less memory needed

Speeding up hash Calculation

hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q

hash(Xi+1) = ( t1*Rm-1 + t2*Rm-2 + t3*Rm-1 + ... tm-1*R + tm ) mod Q

Connection between hash()s

T t0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...

Therefore: hash(Xi+1) = ( ( hash(Xi+1) - t0*Rm-1 ) mod Q )*R

+ tm mod Q ) mod Q

= ( ( hash(Xi+1) + ( t0*Qm-1 - t0*Rm-1 )) mod Q )*R

+ tm mod Q ) mod Q

= ( ( hash(Xi+1) + t0( Q - (Rm-1 mod Q) ) )*R

+ tm ) mod Q

old front value

new end value

include so mod value is positive

a constant,which can be pre-calculated

Using:

Modulo Properties

We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values.

Creating the Hash

P: "26535" R = 10, Q = 997

Hash of the Pattern

the hash value for the pattern

T: "3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3" M = 5, R = 10, Q = 997

Hashing the Text Substrings

In the code RM = Rm-1 mod Q

The hash values forthe M-char substrings

public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp <text> <pattern>"); return; }

RabinKarp searcher = new RabinKarp(args[1]); int pos = searcher.search(args[0]); showPos(args[0], args[1], pos); } // end of main()

Code see RabinKarp.java

public class RabinKarp{ private static final int R = 256; // radix

private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value

private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q

public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime();

// precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() :

private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); }

private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash()

public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M);

// hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0;

// iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q;

// found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search()

private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check()

Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching.

KMP's hashing technique allows the search algorithm to be used on other things than text

e.g. image, audio, video search

Rabin-Karp can be easily modified to do fast multiple pattern search.

check whether the hash of a string in the text belongs to a set of hash values of patterns

Properties of Rabin-Karp

Algorithm Preprocessing timem = pat len.

Matching time (average, worst)

n = text len;

Brute force 0 (no preprocessing) O(n+m), O(nm)

Knuth-Morris-Pratt O(m) O(n)

Rabin-Karp O(m) O(n+m), O(nm)

6. Summary

35 algorithms with C code at http://www-igm.univ-mlv.fr/~lecroq/string/

contest algorithms january 2016 three types of string search: brute force, knuth-morris-pratt (kmp)...

Documents

contest algorithms january 2016 look at some features of...

new block-iterative and string-averaging projection...

information retrieval_ chapter 10_ string searching...

boost string algorithms library - scicomp.ethz.ch ·...

filter algorithms for approximate string matching stefan...

bbm 202 - algorithms today -...

talk about.. string matching algorithms -...

cs5263 bioinformatics lecture 17 exact string matching...

two different approximate string matching problems and their...

contest algorithms january 2016 introduce dp; look at...

rules in exact string matching algorithms

string processing algorithms

multipliers, algorithms, and hardware designs 0 0 no string...

sabin m. thomas - string matching algorithms

faster algorithms for string matching with k mismatches

contest algorithms january 2016 3. collections 1contest...

data structures and algorithms for approximate string

scalable string and suffix sorting: algorithms, techniques...

bioinformatics research and service groups in...

wu mamber (string algorithms 2007)