unit-5 string matching

7
@2008-09 Shankar Thawkar , Sr. Lect. IT dept. @2008-09 Shankar Thawkar , Sr. Lect. IT dept. 1 String Matching We formalize the string-matching problem as follows. Given a text array, T[1 . . n], of n character and a pattern array, P[1 . . m], of m characters. The problem is to find an integer s, called valid shift where 0 s < n-m and T[s+1 . . . s+m] = P[1 . . m]. In other words, to find whether P in T i.e., whether P is a substring of T. 1) Naïve String Matching The naïve approach simple test all the possible placement of Pattern P[1 . . m] relative to text T[1 . . n]. Specifically, we try shift s = 0, 1, . . . , n-m, successively and for each shift, s. Compare T[s+1 . . s+m] to P[1 . . m] NAÏVE_STRING_MATCHER (T, P) 1. n length [T] 2. m ← length [P] 3. for s ← 0 to n-m do 4. if P[1 . . m] = T[s+1 . . s+m] 5. then return valid shift s Complexity: Worst-case= O((n-m+1)m) if m=n/2 then O(n 2 ) Q. Write an algorithm for naïve string matcher? What is its worst case complexity? Show the comparisons the naïve string matcher makes for the pattern P=0001 in the text T=000010001010001

Upload: vamika-chandra

Post on 07-Apr-2015

391 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 1

String Matching

We formalize the string-matching problem as follows. Given a text array, T[1 . . n], of n character and a pattern array, P[1 . . m], of m characters. The problem is to find an integer s, called valid shift where 0 s < n-m and T[s+1 . . .s+m] = P[1 . . m]. In other words, to find whether P in T i.e., whether P is a substring of T.

1) Naïve String Matching

The naïve approach simple test all the possible placement of Pattern P[1 . . m] relative to text T[1 . . n]. Specifically, we try shift s = 0, 1, . . . , n-m, successively and for each shift, s. Compare T[s+1 . . s+m] to P[1 . . m]

NAÏVE_STRING_MATCHER (T, P)

1. n ← length [T] 2. m ← length [P] 3. for s ← 0 to n-m do 4. if P[1 . . m] = T[s+1 . . s+m] 5. then return valid shift s

Complexity: Worst-case= O((n-m+1)m) if m=n/2 then O(n2)

Q. Write an algorithm for naïve string matcher? What is its worst case complexity? Show the comparisons the naïve string matcher makes for the pattern P=0001 in the text T=000010001010001

Page 2: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 2

2) Rabin-Karp Algorithm

1. 2.

Key idea: Key idea: The pattern P[1..m] as a key, transform (hash) The pattern P[1..m] as a key, transform (hash)

it into an equivalent integer it into an equivalent integer pp

Similarly, we transform substrings in the text Similarly, we transform substrings in the text string T[] into integersstring T[] into integers

For s=0,1,…,For s=0,1,…,nn--mm, transform T[s+1..s+m] to an , transform T[s+1..s+m] to an equivalent integer equivalent integer ttss

The pattern occurs at position s if and only if The pattern occurs at position s if and only if p=p=ttss

If we compute p and If we compute p and ttss quickly, then the quickly, then the pattern matching problem is reduced to pattern matching problem is reduced to comparing p with ncomparing p with n--m+1 integers m+1 integers

RabinRabin--Karp Algorithm …Karp Algorithm …

How to compute p?How to compute p?p = 2p = 2mm--11 P[0] + 2P[0] + 2mm--22 P[1] + … + 2 P[mP[1] + … + 2 P[m--2] + P[m2] + P[m--

1] 1]

Using Using horner’shorner’s rulerule

This takes O(m) time, assuming each arithmetic operation can be done in O(1) time.

3. 4.

How it worksHow it works

Hash pattern P into a numeric valueHash pattern P into a numeric value Let a string be represented by the sum of Let a string be represented by the sum of

these digitsthese digitsHornerHorner’’s rule (s rule (§§ 30.1)30.1)

ExampleExample{ A, B, C, ..., Z }{ A, B, C, ..., Z } →→ { 0, 1, 2, ..., }{ 0, 1, 2, ..., }

BAN BAN →→ 1 + 0 + 131 + 0 + 13 = 14= 14

CARDCARD →→ 2 + 0 + 17 + 32 + 0 + 17 + 3 = 22= 22

Upper limitsUpper limits

ProblemProblem For long patterns, or for large alphabets, the number For long patterns, or for large alphabets, the number

representing a given string may be too large to be practicalrepresenting a given string may be too large to be practical

SolutionSolution Use MOD operationUse MOD operation Let q be a prime number so that 2q can be stored in one Let q be a prime number so that 2q can be stored in one

computer word. computer word.

ExampleExample BANBAN = 1 + 0 + 13= 1 + 0 + 13 = 14= 14

14 mod q = 114 mod q = 114 mod 13 = 114 mod 13 = 1BAN BAN →→ 11

CARDCARD = 2 + 0 + 17 + 3= 2 + 0 + 17 + 3 = 22= 2222 mod 13 = 922 mod 13 = 9CARD CARD →→ 99

Page 3: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 3

How it WorksHow it WorksOnce we use the modulo arithmetic, when p=Once we use the modulo arithmetic, when p=ttssfor some s, we can no longer be sure that P[1 .. for some s, we can no longer be sure that P[1 .. M] is equal to T[s+1 .. S+ m ]M] is equal to T[s+1 .. S+ m ]

Therefore, after the equality test p = Therefore, after the equality test p = ttss, we , we should compare P[1..m] with T[s+1..s+m] should compare P[1..m] with T[s+1..s+m] character by character to ensure that we really character by character to ensure that we really have a match.have a match.

So the worstSo the worst--case running time becomes case running time becomes O(nmO(nm), ), but it avoids a lot of unnecessary string but it avoids a lot of unnecessary string matchingsmatchings in practice.in practice.

• if the hash values match, the strings might not match

and in those cases we have the spurious hits .

Algorithm : RabinKarp(T[1.. n], P[1..m])

1: hsub = hash(P[1::m]) i.e p

2: hs = hash(T[1::m]) i.e. ts3: for s = 0 to n - m do

4: if hs = hsub then

5: if T[s+1.. S+m] = P then

6: print “Pattern occurs with shift” i

7: hs = hash(T[i + 1..i + m])

Q. Write a rabin-karp algo for string matching. Given working modulo q=11.how may spurious hits does the rabin karp matcher encountered in the Text T=3151592653589793 when looking for pattern P=26.

Ans : Given q=11 , T=3151592653589793 and P=26. p= P mod q p= 26 mod 11 = 4 The find ts for the text T as ts= 31 mod 11 , ts+1= 15 mod 11

3 1 5 1 5 9 2 6 5 3 5 8 9 7 9 3

31 mod 11= 9 match

9 3 8 4 4 4 4 10 9 2 3 1 9 2

Spurious

Spurious hits=3 and match =1

Page 4: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 4

3) The KMP Algorithm

The KnuthThe Knuth--MorrisMorris--Pratt (KMP) algorithm Pratt (KMP) algorithm looks for the pattern in the text in a looks for the pattern in the text in a leftleft--toto--rightright order (like the brute force algorithm).order (like the brute force algorithm).

But it shifts the pattern more intelligently But it shifts the pattern more intelligently than the brute force algorithm.than the brute force algorithm.

continued

If a mismatch occurs between the text and If a mismatch occurs between the text and pattern P at P[pattern P at P[ jj ], what is the ], what is the mostmost we can we can shift the pattern to avoid wasteful shift the pattern to avoid wasteful comparisons?comparisons?

AnswerAnswer: the largest prefix of P[0 .. j: the largest prefix of P[0 .. j--1] that 1] that is a suffix of P[1 .. jis a suffix of P[1 .. j--1]1]

ExampleExample

T:

P:

jnew = 2

j = 5

i

The Prefix Function

The KMP algorithm preprocess the pattern P by computing a prefix function that indicates the largest possible shift s using previously performed comparisons. Specifically, the prefix function (q) is defined as the length of the longest prefix of P .

KNUTH-MORRIS-PRATT Prefix Function (P)

Input: Pattern with m characters

1. m=length[P]2. [1]=03. k=04. for q=2 to m5. while k>0 and P[k+1]<> P[q]

Page 5: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 5

6. k=[k]7. if P[k+1]=P[q] then8. k=k+19. [q]=k

Note that the prefix function for P, which maps q to the length of the longest prefix of P that is a suffix of P[1 . . q], encodes repeated substrings inside the pattern itself.

As an example, consider the pattern P = a b b a b a . The prefix function, using above algorithm is

q 1 2 3 4 5 6P[q] a b b a b a(q) 0 0 0 1 2 1

Example of Pattern matching:

Analysis

The running time of Knuth-Morris-Pratt algorithm is proportional to the time needed to read the characters in text and pattern. In other words, the worst-case running time of the algorithm is O(m+n) and it requires O(m) extra space. It is important to note that these quantities are independent of the size of the underlying alphabet.

Q. Explain kunth-morris-pratt string matching algorithm. Write an algorithm to find Prefix function. Calculate the prefix function for the patter – a b b a b a [ Ans : shown in above prefix example.]

Page 6: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 6

Page 7: Unit-5 String Matching

@2008-09 Shankar Thawkar , Sr. Lect. IT dept.

@2008-09 Shankar Thawkar , Sr. Lect. IT dept. 7