15 string matching
TRANSCRIPT
-
7/28/2019 15 String Matching
1/45
Jim Anderson Comp 750, Fall 2009 String Matching - 1
Chapter 32: String Matching
Given: Two strings T[1..n] and P[1..m] over alphabet .
Want to find all occurrences of P[1..m] the pattern in T[1..n] the text.
Example: = {a, b, c}a b c a b a a b c a a b a c
a b a a
text T
pattern P s=3
Terminology:
- P occurs with shift s.
- P occurs beginning at position s+1.
- s is a valid shift.
Goal: Find all valid shifts.
Applications: Text editors, search for patterns in DNA sequences
(actually, this is stretching the truth a little),
-
7/28/2019 15 String Matching
2/45
Jim Anderson Comp 750, Fall 2009 String Matching - 2
Notation and Terminology
w pre x -- w is a prefix of x.
Example: aba pre abaabc.
w suf x -- w is a suffix of x.
Example: abc suf abaabc.
Note: In the book the symbol is used instead of pre,
and the symbol is used instead of suf.
I couldnt figure out an easy way to reproduce these
symbols in Powerpoint.
-
7/28/2019 15 String Matching
3/45
Jim Anderson Comp 750, Fall 2009 String Matching - 3
Lemma 32.1
Lemma 32.1: Suppose x suf z and y suf z. If |x| |y| thenx suf y. If |x| |y| then y suf x. If |x| = |y| then x = y.
x
z
y
x
y
x
z
y
x
y
x
z
y
x
y
-
7/28/2019 15 String Matching
4/45
Jim Anderson Comp 750, Fall 2009 String Matching - 4
More Notation
Pk= P[1..k] where k m.
Thus, P0 = , Pm = P[1..m] = P.
Similarly Tk= T[1..k], where k n.
Our Problem: Find all s, where 0 s nm such that P suf Ts+m.
Assumption:We assume the test x = y takes (t + 1) time,
where t is the length of the longest string z such that z pre xand z pre y.
-
7/28/2019 15 String Matching
5/45
Jim Anderson Comp 750, Fall 2009 String Matching - 5
Nave Brute-Force Algorithm
Nave(T, P)n := length[T];
m := length[P];
for s := 0 to nm do
ifP[1..m] = T[s+1..s+m] thenprintpattern occurs with shift s
fi
od
Running time is ((nm + 1)m).
Bound is tight. Consider: T = an, P = am.
-
7/28/2019 15 String Matching
6/45
Jim Anderson Comp 750, Fall 2009 String Matching - 6
Example
a c a a b c
a a bs = 0
-
7/28/2019 15 String Matching
7/45Jim Anderson Comp 750, Fall 2009 String Matching - 7
Example
a c a a b c
a a bs = 0
-
7/28/2019 15 String Matching
8/45Jim Anderson Comp 750, Fall 2009 String Matching - 8
Example
a c a a b c
a a bs = 0
-
7/28/2019 15 String Matching
9/45Jim Anderson Comp 750, Fall 2009 String Matching - 9
Example
a c a a b c
a a bs = 1
-
7/28/2019 15 String Matching
10/45Jim Anderson Comp 750, Fall 2009 String Matching - 10
Example
a c a a b c
a a bs = 1
-
7/28/2019 15 String Matching
11/45
Jim Anderson Comp 750, Fall 2009 String Matching - 11
Example
a c a a b c
a a bs = 2
-
7/28/2019 15 String Matching
12/45
Jim Anderson Comp 750, Fall 2009 String Matching - 12
Example
a c a a b c
a a bs = 2
-
7/28/2019 15 String Matching
13/45
Jim Anderson Comp 750, Fall 2009 String Matching - 13
Example
a c a a b c
a a bs = 2
-
7/28/2019 15 String Matching
14/45
Jim Anderson Comp 750, Fall 2009 String Matching - 14
Example
a c a a b c
a a bs = 2
match!
-
7/28/2019 15 String Matching
15/45
Jim Anderson Comp 750, Fall 2009 String Matching - 15
Example
a c a a b c
a a bs = 3
-
7/28/2019 15 String Matching
16/45
Jim Anderson Comp 750, Fall 2009 String Matching - 16
Example
a c a a b c
a a bs = 3
-
7/28/2019 15 String Matching
17/45
Jim Anderson Comp 750, Fall 2009 String Matching - 17
Example
a c a a b c
a a bs = 3
-
7/28/2019 15 String Matching
18/45
Jim Anderson Comp 750, Fall 2009 String Matching - 18
Rabin-Karp Algorithm
Suppose = {0, 1, 2, , 9}.
Let us view P as a decimal number.
Example: View P = 31415 as 31,415.
Can also view substrings of T as decimal numbers.
Let ts = the decimal number corresponding to T[s+1..s+m].
Letp = the decimal number corresponding to P.
We want to know all s such that ts = p.
We can compute p in O(m)timeusing Horners Rule:
p = P[m] + 10(P[m1] + 10(P[m2] + + 10(P[2] + 10 P[1]) ))
Can similarly compute t0 in O(m) time.
-
7/28/2019 15 String Matching
19/45
Jim Anderson Comp 750, Fall 2009 String Matching - 19
RK Algorithm (Continued)
Can compute t1, t2, , tn-m in O(nm) time as follows:ts+1 = 10(ts10
m-1T[s+1]) + T[s+m+1].
Example:T = 314152
ts+1 = 10(31415100003) + 2= 14152
Time Complexity:
O(n+m) + O(nm) = O(n+m)to compute
p and t0,,tn-m
to perform
nm+1 comparisons
-
7/28/2019 15 String Matching
20/45
Jim Anderson Comp 750, Fall 2009 String Matching - 20
Two Problems
Might have || = d 10. Solution: Use radix-d arithmetic.
Numbers may be very large.
Solution: Perform computations modulo-q for some q.
-
7/28/2019 15 String Matching
21/45
Jim Anderson Comp 750, Fall 2009 String Matching - 21
What is q?
Select q to be a large prime such that dq fits in one memory word.
all computations can be performed using single-precisionarithmetic.
To summarize, p is computed using
p = (P[m] + d(P[m1] + d(P[m2] + + d(P[2] + d P[1]) ))) mod q
t0 is computed similarly.
Other tis are computed usingts+1 = (d(tsT[s+1]h) + T[s+m+1]) mod q, where h dm-1 (mod q)
Unfortunately, we have a new problem: spurious hits.
-
7/28/2019 15 String Matching
22/45
Jim Anderson Comp 750, Fall 2009 String Matching - 22
Example
pattern P
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
3 1 4 1 5
7
mod 13
text T
7mod 13
7
mod 13
valid
match
spurious
hit
Al i h
-
7/28/2019 15 String Matching
23/45
Jim Anderson Comp 750, Fall 2009 String Matching - 23
Algorithm
We deal with spurious
hits by performing an
explicit checkwheneverthere is a potential match.
RK(T, P, d, q)
n := length[T];
m := length[P];h := dm-1 mod q;
p := 0;
t0 := 0;
for i := 1 to m do
p := (dp + P[i]) mod q;
t0 := (dt0 + P[i]) mod qod;
for s := 0 to nm do
ifp = tsthen
ifP[1..m] = T[s+1..s+m] then
printpattern occurs with shift s
fifi;
ifs < n-m then
ts+1 := (d(tsT[s+1]h) + T[s+m+1]) mod q
fi
od
-
7/28/2019 15 String Matching
24/45
Jim Anderson Comp 750, Fall 2009 String Matching - 24
Running TimeWorst-Case:((nm + 1)m). (Again, consider P = am, T = an.)
Average-Case:
Some assumptions
Assume O(1) valid shifts.
Think of 0, 1, , q 1 like hash buckets.
Assume each bucket is equally likely.
We expect O(n/q) spurious hits.
Expected running time is:
O(n) + O(m(number of valid shifts + n/q))= O(n+m) choosing q m
-
7/28/2019 15 String Matching
25/45
Jim Anderson Comp 750, Fall 2009 String Matching - 25
Finite Automata Algorithm
0 1 2 3 4 5 6 7a b a b a c a
a
a
a
a
b
b
a b c P0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
state
input
i -- 1 2 3 4 5 6 7 8 9 10 11
T[i] -- a b a b a b a c a b a
state (i) 0 1 2 3 4 5 4 5 6 7 2 3
Processing time takes (n).But have to first construct FA.
Main Issue: How to construct FA?
-
7/28/2019 15 String Matching
26/45
Jim Anderson Comp 750, Fall 2009 String Matching - 26
Need some Notation
(w) = state FA ends up in after processing w.
Example:(abab) = 4.
(x) = max{k: Pksuf x}. Called the suffix function.
Examples: Let P = ab.
() = 0(ccaca) = 1(ccab) = 2
Note: If |P| = m, then (x) = m indicates a match.T: a b a b b a b b a c
States:0 1..m.m.
Note Also: x suf y (x) (y).match match
-
7/28/2019 15 String Matching
27/45
Jim Anderson Comp 750, Fall 2009 String Matching - 27
FA ConstructionGiven: P[1..m]
Let Q = states = {0, 1, , m}.
Define transition function as follows:
(q, a) = (Pqa) for each q and a.
Example:(5, b) = (P5b)=(ababab)= 4
Intuition:Encountering a b in state 5 means the current substring
doesnt match. But, you know this substring ends with abab -- and
this is the longest suffix that matches the beginning of P. Thus, we
go to state 4 and continue processing abab .
initial final
-
7/28/2019 15 String Matching
28/45
Jim Anderson Comp 750, Fall 2009 String Matching - 28
Time Complexity
FA takes O(m||) time to construct.
(Book only gives a O(m3||) algorithm.)
Total time is O(n + m||).
-
7/28/2019 15 String Matching
29/45
Jim Anderson Comp 750, Fall 2009 String Matching - 29
Correctness
Lemma 32.2:(xa) (x) + 1.
Proof:
Let r = (xa).
Case: r = 0. Clearly (xa) (x) + 1.
Case: r > 0.
a
Pr
x
Pr-1
We have:
Prsuf xa. Pr-1 suf x r1 (x).
-
7/28/2019 15 String Matching
30/45
Jim Anderson Comp 750, Fall 2009 String Matching - 30
Another Lemma
Lemma 32.3: q = (x) (xa) = (Pqa) .
Proof:
Let q = (x). Pq suf x P
q
a suf xa
Let r = (xa). By Lemma 32.2, r q + 1. We have:
a
Pq
x
a
Pr
(Pqa) = r.
-
7/28/2019 15 String Matching
31/45
Jim Anderson Comp 750, Fall 2009 String Matching - 31
Main Theorem
Theorem 32.4:(Ti) = (Ti) for all i = 0, 1, , n.
Implies: in accepting state
if and only if
string processed so far has a match at position (length of string) m
Proof:
Induction on i.
Basis: i = 0. T0 = .(T0) = (T0) = 0.
Step: Assume (Ti) = (Ti).
Let q = (Ti), a = T[i+1].
Then, (Ti) = (Ti) = q, which byLemma 32.3, implies
(Tia) = (Pqa). (**)
-
7/28/2019 15 String Matching
32/45
Jim Anderson Comp 750, Fall 2009 String Matching - 32
Proof Continued
(Ti+1) = (Tia) , Ti+1 = Tia= ((Ti), a) , (wa) = ((w), a)= (q, a) , q = (Ti)
= (Pqa) , (q, a) = (Pqa)= (Tia) , by (**)= (Ti+1) , Ti+1 = Tia
-
7/28/2019 15 String Matching
33/45
Jim Anderson Comp 750, Fall 2009 String Matching - 33
Knuth-Morris-Pratt Algorithm
Achieves (n + m) by avoiding precomputationof.
Instead, we precompute [1..m] in O(m) time.
As T is scanned, [1..m] is used to deduceinformation given by in FA algorithm.
-
7/28/2019 15 String Matching
34/45
Jim Anderson Comp 750, Fall 2009 String Matching - 34
Motivating Example
b a c b a b
a b as
a b a a b c b a b
b a c a
T
P
q
Shift s is discovered to be invalid because of mismatch
of 6th character of P.
By definition of P, we also know s + 1 is an invalid shift
However, s + 2 may be a valid shift.
i i l
-
7/28/2019 15 String Matching
35/45
Jim Anderson Comp 750, Fall 2009 String Matching - 35
Motivating Example
b a c b a b
a b as + 2
a b a a b c b a b
b a c a
T
P
k
The shift s + 2. Note that the first 3 characters of T starting
at s + 2 dont have to be checked again -- we already know
what they are.
i i l
-
7/28/2019 15 String Matching
36/45
Jim Anderson Comp 750, Fall 2009 String Matching - 36
Motivating Example
a b
a b a
a b a
Pk
The longest prefix of P that is also a proper suffix of P5 is P3.
We will define [5] = 3.
Pq
In general, if q characters have matched successfully at shifts, the next potentially valid shift is s = s + (q[q]).
Th P fi F i
-
7/28/2019 15 String Matching
37/45
Jim Anderson Comp 750, Fall 2009 String Matching - 37
The Prefix Function
is called the prefix function for P.
: {1, 2, , m} {0, 1, , m1}
[q] = length of the longest prefixof P that is a proper suffixof Pq, i.e.,
[q] = max{k: k < q and Pksuf Pq}.
Compute-(P)1 m := length[P];
2 [1] := 0;3 k := 0;
4 for q := 2 to m do
5 while k > 0 and P[k+1] P[q] do6 k := [k]
od;
7 ifP[k+1] = P[q] then
8 k := k + 1
fi;
9 [q] := kod;
10 return
E l
-
7/28/2019 15 String Matching
38/45
Jim Anderson Comp 750, Fall 2009 String Matching - 38
Example
i 1 2 3 4 5 6 7
P[i] a b a b a c a
[i] 0 0 1 2 3 0 1
Same as our
FA example
P7 = a b a b a c aa = P1
P6 = a b a b a c
= P0
P5 = a b a b a
a b a = P3
P4 = a b a ba b = P2
P3 = a b a
a = P1
P2 = a b
= P0
P1 = a = P0
A h E l i
-
7/28/2019 15 String Matching
39/45
Jim Anderson Comp 750, Fall 2009 String Matching - 39
Another Explanation
0 1 2 3 4 5 6 7a b a b a c a
Essentially KMP is computing a FA with epsilon moves. The spine
of the FA is implicit and doesnt have to be computed -- its just thepattern P. gives the transitions. There are O(m) such transitions.
Recall from Comp 455 that a FA with epsilon moves is
conceptually able to be in several states at the same time (in
parallel). Thats whats happening here -- were exploring
pieces of the pattern in parallel.
A h E l
-
7/28/2019 15 String Matching
40/45
Jim Anderson Comp 750, Fall 2009 String Matching - 40
Another Example
i 1 2 3 4 5 6 7 8 9 10
P[i] a b b a b a a b b a[i] 0 0 0 1 2 1 1 2 3 4
P7 = a b b a b a aa = P1
P6 = a b b a b a
a = P1
P5 = a b b a b
a b = P2
P4 = a b b aa = P1
P3 = a b b
= P0
P2 = a b
= P0
P1 = a = P0
P10 = a b b a b a a b b a a b b a = P4
P9 = a b b a b a a b b
a b b = P3
P8 = a b b a b a a b
a b = P2
Ti C l it
-
7/28/2019 15 String Matching
41/45
Jim Anderson Comp 750, Fall 2009 String Matching - 41
Time Complexity
Amortized Analysis --
0 loop q = 2 (1st iteration)1 loop q = 3 (2nd iteration)2 loop q = 4 (3rd iteration)
loop q = m ((m1)st iteration)m-1
= potential function = value of k
Amortized cost:i = ci + ii-1
iteration actual loop cost
Ti C l it (C ti d)
-
7/28/2019 15 String Matching
42/45
Jim Anderson Comp 750, Fall 2009 String Matching - 42
Time Complexity (Continued)
Total amortized cost:
1m
1i01-mi
1m
1i
1iii
1m
1i
i
c
)(c
c
Ifm-1 0, then amortized cost upper bounds real cost.
We have 0
= 0 (initial value of k)
m-1 0 (final value of k).
We show i = O(1).
Ti C l it (C ti d)
-
7/28/2019 15 String Matching
43/45
Jim Anderson Comp 750, Fall 2009 String Matching - 43
Time Complexity (Continued)
The value ofi obviously depends on how many times statement
6 is executed.
Note that k > [k]. Thus, each execution of statement 6 decreasesk by at least 1.
So, suppose that statements 5..6 iterate several times, decreasingthe value of k.
We have: number of iterations koldknew. Thus,
i O(1) + 2(k
oldk
new) +
i
i-1
Hence, i = O(1). Total cost is therefore O(m).
for statements
other than 5 & 6= knew = kold
R t f th Al ith
-
7/28/2019 15 String Matching
44/45
Jim Anderson Comp 750, Fall 2009 String Matching - 44
Rest of the Algorithm
KMP(T, P)
n := length[T];
m := length[P];
:= Compute-(P);q := 0;
for i := 1 to n do
while q > 0 and P[q+1] T[i] doq := [q]
od;
ifP[q+1] = T[i] then
q := q + 1
fi;
ifq = m thenprintpattern occurs with shift i m;
q := [q]fi
od
Time complexity
of loop is O(n)
(similar to the
analysis of
Compute-).
Total time is
O(m + n).
E l
-
7/28/2019 15 String Matching
45/45
Examplei 1 2 3 4 5
P[i] a b a b c[i] 0 0 1 2 0
P = a b a b c
1 2 3 4 5 6 7 8 9 10
T = a b b a b a b a b c
Start of 1st loop: q = 0, i = 1 [a]
2nd loop: q = 1, i = 2 [b]
3rd loop: q = 2, i = 3 [b]
4th loop: q = 0, i = 4 [a]
5th loop: q = 1, i = 5 [b]
6th loop: q = 2, i = 6 [a]7th loop: q = 3, i = 7 [b]
mismatch
detected
8th loop: q = 4, i = 8 [a]
9th loop: q = 3, i = 9 [b]
10th loop: q = 4, i = 10 [c]
Termination: q = 5
mismatch
detected
match
detected
Please see the book for formal correctness proofs.
(Theyre very tedious.)