1 average case analysis of an exact string matching algorithm advisor: professor r. c. t. lee...

22
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

Upload: megan-mcnamara

Post on 27-Mar-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

1

Average Case Analysis of an Exact String Matching Algorithm

Advisor: Professor R. C. T. LeeSpeaker: S. C. Chen

Page 2: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

2

Problem Definition

We are given text T=t1t2…tn with length n and a pattern P=p1p2…pm with length m and we are asked to find all occurrences of P in T.

Example:

CCTAP

CCTAAGTCAGCCTAAGCTT

There are two occurrences of P in Tas shown below:

AGTCCCTAAGCTCCTAAG

Page 3: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

3

There are many rules in exact string matching algorithms. For example, the Suffix to Prefix Rule, the Substring Matching Rule, ….

Page 4: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

4

We use the idea, the substring matching rule, in this algorithm.

Page 5: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

5

The Substring Matching Rule

For any substring S in T, find a nearest S in P which is to the left of it. If such an S in P exists, move P such then the two S’s match; otherwise, we may define a new partial window.

windows

T

P

Exactly matched

S

S

windows

T

P

S

S

Page 6: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

6

windows

T

P

windows

T

P

S

S

ii-r+1i-m+1

ii-r+1i-m+1

Page 7: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

7

In this algorithm, we first check whether S=T[i-r+1…i] is a substring of P or not.

If S does not occur in P, we shift P to right m-r steps.

windows

T

P

windows

T

P

S

S

mm-r+11

mm-r+11

Page 8: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

8

• If S occurs in P, according to the Substring Matching Rule, we should slide P so that the two substrings S match as shown below.

windows

T

P

S

S

Page 9: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

9

• But, our algorithm is not that smart, instead of sliding P so that the two substrings S match, we simply examine the entire window starting from i-m+1 to 2i-r to see whether P occurs in this window, as shown below.

2m-r

T

P

S

ii-m+1 2i-ri-r+1

Page 10: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

10

• Note that our not so smart algorithm covers the case of sliding P to match the two substrings S.

2m-r

T

P

S

ii-m+1 2i-ri-r+1

S

Page 11: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

11

Algorithm•Algorithm fast-on-average;• i=m;• while i≦n do begain• if T[i-r+1…i] is a substring of P then• compute all occurrences of P whose starting positions are in T[i-m+1…i-r+1] applying KMP algorithm.• else { P does not start in T[i-m+1…i-r+1] }• i=i+m-r•end

Page 12: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

12

Analysis

First of all, let us note that in the above algorithm, we have to determine whether the suffix S occurs in P or not. This is again an exact string matching problem. Let us assume that there is a pre-processing to construct a suffix tree of P. Whether S occurs in P or not can be determined by feeding S into the suffix tree of P. Because the length of S is r, we can determine whether S occurs in P in O(r).

Page 13: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

13

For reasons which will become clear later, we assume that mr log2

Page 14: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

14

We assume that the text is a random string and the size of alphabet is α.

Page 15: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

15

There are αr possible substrings with length r consisting of α distinct characters.

There are only m-r substrings with length r in P whose length is m .

Thus, the probability that S is a substring of P is not great than

mm

rmrmrm

mr

11)(

1)(

2log2

Page 16: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

16

If S is a substring of P, we find all occurrences of P in T[i-m…2i-r] using KMP algorithm.

2m-r

T

P

S

ii-m+1 2i-ri-r+1

Page 17: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

17

2m-r

T

P

S

ii-m+1 2i-ri-r+1

Because the length of T[i-m…2i-r] is 2m-r, time complexity of Step i using KMP algorithm is O(m)

Page 18: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

18

(1)The probability that S occurs in P is .

(2)When S occurs in P, the time complexity that we use KMP algorithm to find all occurrences of P in T[i-m+1…2i-r] is O(m).

Summary of (2) and (3), the average time-complexity of applying the KMP algorithm is )1(

1O

mmO

In the above, the time complexity of checking whether S occurs in P is O(r).

Thus, the average time-complexity of applying the KMP algorithm once is O(r).

m

1

Page 19: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

19

Thus, if S does not occurs in P, the time complexity of Step i is only the checking time-complexity which is O(r) .

If does, the time complexity of Step i is O(r).

Page 20: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

20

Because there are windows with length m in T, the time complexity of this algorithm on average is .

m

nO

m

mnOr

m

nO log

Page 21: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

21

Reference

• [KMP77] Faster Pattern Matching in Strings, SIAM Journal on Computing 6 (2),1977, pp. 323–350.

• [CR2002] Section 2.2:Boyer-Moore algorithm and its variations, Jewels of Stringology, 2002, pp. 30-31.

Page 22: 1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen

22

Thank you