![Page 1: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/1.jpg)
Semi-Numerical String Matching
![Page 2: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/2.jpg)
All the methods we’ve seen so far have been based on comparisons.
We propose alternative methods of computation such as:
Arithmetic. Bit – operations. The fast Fourier transform.
Semi-numerical String Matching
![Page 3: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/3.jpg)
We will survey three examples of such methods:
The Random Fingerprint method due to Karp and Rabin.
Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrep due to Wu and Manber.
A solution to the match count problem using the fast Fourier transform due to Fischer and Paterson and an improvement due to Abrahamson.
Semi-numerical String Matching
![Page 4: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/4.jpg)
Exact match problem: we want to find all the occurrences of the pattern P in the text T.
The pattern P is of length n. The text T is of length m.
Karp-Rabin fingerprint - exact match
![Page 5: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/5.jpg)
Arithmetic replaces comparisons.
An efficient randomized algorithm that makes an error with small probability.
A randomized algorithm that never errors whose expected running time is efficient.
We will consider a binary alphabet: {0,1}.
Karp-Rabin fingerprint - exact match
![Page 6: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/6.jpg)
Strings are also numbers, H: strings → numbers. Let s be a string of length n,
Definition:let Tr denote the n length substring of T starting at position r.
Arithmetic replaces comparisons.
n
i
in issH1
)(2)(
![Page 7: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/7.jpg)
Strings are also numbers, H: strings → numbers.
T = 1 0 1 1 0 1 0 1
P = 0 1 0 1
T = 1 0 1 1 0 1 0 1 H(T5) = 5 =
P = 0 1 0 1 H(P) = 5
T = 1 0 1 1 0 1 0 1 H(T2) = 6 ≠
P = 0 1 0 1 H(P) = 5
Arithmetic replaces comparisons.
![Page 8: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/8.jpg)
Theorem:
There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)
Proof:
Follows immediately from the unique representation of a number in base 2.
Arithmetic replaces comparisons.
![Page 9: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/9.jpg)
We can compute H(Tr) from H(Tr-1)
T = 1 0 1 1 0 1 0 1 T1 = 1 0 1 1
T2 = 0 1 1 0
Arithmetic replaces comparisons.
)1()1(2)(2)( 1 nrTrTTHTH nrr
)0110(61622012112)(
11)1011()(4
2
1
HTH
HTH
![Page 10: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/10.jpg)
A simple efficient algorithm:
Compute H(T1). Run over T
Compute H(Tr) from H(Tr-1) in constant time,and make the comparisons.
Total running time O(m)?
Arithmetic replaces comparisons.
![Page 11: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/11.jpg)
![Page 12: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/12.jpg)
Let’s use modular arithmetic, this will help us keep the numbers small.
For some integer p The fingerprint of P is defined byHp(P) = H(P) (mod p)
Karp-Rabin
![Page 13: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/13.jpg)
Lemma:
And during this computation no number ever exceeds 2p.
Karp-Rabin
))(mod(
))}(mod()))...](mod4()(mod2)}3(
)(mod2)]2()(mod2)1({[...({[)(
pPH
pnPpPpP
pPpPPH p
![Page 14: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/14.jpg)
P = 1 0 1 1 1 1 H(P) = 47
p = 7 Hp(P) = 47 (mod 7) = 5
An example
)(5)7(mod5
51)7(mod22
21)7(mod24
41)7(mod25
51)7(mod22
20)7(mod21
PH p
![Page 15: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/15.jpg)
Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1).
Arithmetic:
Modular arithmetic:
Karp-Rabin
)1()1(2)(2)( 1 nrTrTTHTH nrr
))](mod1()1())(mod2(
)))(mod(2[()( 1
pnrTrTp
pTHTHn
rrp
![Page 16: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/16.jpg)
Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1).
Arithmetic:
Modular arithmetic:
Karp-Rabin
)2(22 1 nn
)))(mod(mod2(2)(mod2 1 ppp nn
![Page 17: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/17.jpg)
How about the comparisons?
Arithmetic:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)
Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hp(P) = Hp(Tr)
There are values of p for which the converse is not true!
Karp-Rabin
![Page 18: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/18.jpg)
Definition:
If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at position r, we say there is a false match between P and T at position r.
If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T.
Karp-Rabin
![Page 19: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/19.jpg)
Our goal will be to choose a modulus p such that
p is small enough to keep computations efficient. p is large enough so that the probability of a false
match is kept small.
Karp-Rabin
![Page 20: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/20.jpg)
Definition:For a positive integer u, п(u) is the number of primes that are less than or equal to u.
Prime number theorem (without proof):
Prime moduli limit false matches
)ln(26.1)(
)ln( u
uu
u
u
![Page 21: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/21.jpg)
Lemma (without proof):if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2u.
Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is
6,469,693,230 ≥ 536,870,912 = 229
Prime moduli limit false matches
![Page 22: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/22.jpg)
Corollary:If u ≥ 29 and x is any number less than or equal to 2u, then x has fewer than п(u) distinct prime divisors.
Proof: Assume x has k ≥ п(u) distinct prime divisors q1 , …, qk then 2u ≥ x ≥ q1* …* qk but q1* …* qk is at least as large as the product of the first п(u) prime numbers.
Prime moduli limit false matches
![Page 23: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/23.jpg)
Theorem:Let I be a positive integer, and p a randomly chosen prime less than or equal to I.If nm ≥ 29 thenThe probability of a false match between P and T is less than or equal to п(nm) / п(I) .
Prime moduli limit false matches
![Page 24: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/24.jpg)
Proof: Let R be the set of positions in T where P doesn’t
begin. We have By the corollary the product has at most п(nm)
distinct prime divisors. If there is a false match at position r then p divides
thus also divides
p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I).
Prime moduli limit false matches
nm
Rs sTHPH 2|)()(|
|)()(| rTHPH
Rs sTHPH |)()(|
![Page 25: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/25.jpg)
Choose a positive integer I. Pick a random prime p less than or equal to I, and
compute P’s fingerprint – Hp(P).
For each position r in T, comput Hp(Tr) and test to see if it equals Hp(P). If the numbers are equal either declare a probable match or check and declare a definite match.
Running time: excluding verification O(m).
Random fingerprint algorithm
![Page 26: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/26.jpg)
The smaller I is, computations are more efficient The larger I is, the probability of a false match
decresses.
Proposition:When I = nm2
1. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m.
How to choose I
![Page 27: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/27.jpg)
Proof:
How to choose I
mmn
mn
m
nm
nm
nm
nm
nm
nm
53.2
)ln()ln(
)ln(2)ln(126.1
)ln(
)ln(26.1
)(
)( 2
22
![Page 28: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/28.jpg)
An idea: why not choose k primes?
Proposition: when k primes are chosen randomly and
independently between 1 and I, the probability of a false match is at most
Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers.
Extensions
k
I
nm
)(
)(
![Page 29: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/29.jpg)
k = 4, n = 250, m = 4000I = 250*40002 < 232
An illustaration
12104000
53.2
)(
)(
kk
I
nm
![Page 30: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/30.jpg)
When k primes are used, the probability of a false match is at most
Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes that divide it.Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n).
Even lower limits on the error
k
I
n
)(
)(
![Page 31: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/31.jpg)
Consider the list L of locations in T where the Karp-Rabin algorithm declares P to be found.
A run is a maximal interval of starting locationsl1, l2, …, lr in L such that every two numbers differ by at most n/2.
Let’s verify a run.
Checking for error in linear time
![Page 32: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/32.jpg)
Check the first two declared occurrences explicitly.P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
If there is a false match stop. Otherwise P is semi periodic with period
d = l1 – l2.
Checking for error in linear time
![Page 33: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/33.jpg)
d is the minimal period.
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
![Page 34: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/34.jpg)
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
For each i check that li+1 – li = d.
Check the last d characters of li for each i.
Checking for error in linear time
![Page 35: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/35.jpg)
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
Check l1
![Page 36: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/36.jpg)
P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
Check l2
P is semi periodic with period 3.
![Page 37: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/37.jpg)
T = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
Check li+1 – li = 3
![Page 38: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/38.jpg)
For each i check the last 3 characters of li.
P = babT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
![Page 39: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/39.jpg)
For each i check the last 3 characters of li.
P = babT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
![Page 40: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/40.jpg)
For each i check the last 3 characters of li.
Report a false match or approve the run.
P = babT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time
![Page 41: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/41.jpg)
No character of T is examined more than twice during a single run.
Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs.
Total verification time O(m).
Time analysis
![Page 42: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/42.jpg)
When we have a false match we start again with a different prime.
The expected probability of a false match is O(1/m).
We have converted the algorithm to one that never mistakes with expected linear running time.
Time analysis
![Page 43: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/43.jpg)
It is efficient and simple. It is space efficient. It can be generalized to solve harder problems
such as 2-dimensional string matching. It’s performance is backed up by a concrete
theoretical analysis.
Why use Karp-Rabin?
![Page 44: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/44.jpg)
The Shift-And
Method
![Page 45: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/45.jpg)
We start with the exact match problem.
Define M to be a binary n by m matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j.
M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]
The Shift-And Method
![Page 46: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/46.jpg)
Let T = california Let P = for
M =
M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j.
How does M solve the exact match problem?
The Shift-And Method
1 2 3 4 5 6 7 8 9 m = 10
1 0 0 0 0 1 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0
n=3 0 0 0 0 0 0 1 0 0 0
![Page 47: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/47.jpg)
How to construct M
We will construct M column by column. Two definitions are in order: Bit-Shift(j-1) is the vector derived by shifting the
vector for column j-1 down by one and setting the first bit to 1.
Example:
0
1
1
0
1
)
1
0
1
1
0
(BitShift
![Page 48: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/48.jpg)
We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears.
Example:
P = abaac
How to construct M
0
1
1
0
1
)(aU
0
0
0
1
0
)(bU
1
0
0
0
0
)(cU
![Page 49: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/49.jpg)
Initialize column 0 of M to all zeros For j > 1 column j is obtained by
How to construct M
))(()1()( jTUjBitShiftjM
![Page 50: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/50.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
1 2 3 4 5
P = a b a a c
An example j = 1
1 2 3 4 5 6 7 8 9 10
1 0
2 0
3 0
4 0
5 0
0
0
0
0
0
)(xU
0
0
0
0
0
0
0
0
0
0
&
0
0
0
0
1
))1((&)0( TUBitShift
![Page 51: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/51.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
1 2 3 4 5
P = a b a a c
An example j = 2
0
1
1
0
1
)(aU
1 2 3 4 5 6 7 8 9 10
1 0 1
2 0 0
3 0 0
4 0 0
5 0 0
0
0
0
0
1
0
1
1
0
1
&
0
0
0
0
1
))2((&)1( TUBitShift
![Page 52: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/52.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
1 2 3 4 5
P = a b a a c
An example j = 3
0
0
0
1
0
)(bU
1 2 3 4 5 6 7 8 9 10
1 0 1 0
2 0 0 1
3 0 0 0
4 0 0 0
5 0 0 0
0
0
0
1
0
0
0
0
1
0
&
0
0
0
1
1
))3((&)2( TUBitShift
![Page 53: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/53.jpg)
![Page 54: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/54.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
1 2 3 4 5
P = a b a a c
An example j = 8
0
1
1
0
1
)(aU
1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 1 0 1 1
2 0 0 1 0 0 1 0 0
3 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0
0
1
0
0
1
0
1
1
0
1
&
0
1
0
1
1
))8((&)7( TUBitShift
![Page 55: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/55.jpg)
For i > 1, Entry M(i,j) = 1 iff
1) The first i-1 characters of P match the i-1characters of T ending at character j-1.
2) Character P(i) ≡ T(j).
1) is true when M(i-1,j-1) = 1. 2) is true when the i’th bit of U(T(j)) = 1.
The algorithm computes the and of these two bits.
Correctness
![Page 56: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/56.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
a b a a c
Correctness
1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 1 0 1 1 0 1
2 0 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 0 0 0 0 M(4,8) = 1, this is because a b a a is a prefix of P of length
4 that ends at position 8 in T. Condition 1) – We had a b a as a prefix of length 3 that
ended at position 7 in T ↔ M(3,7) = 1. Condition 2) – The fourth bit of P is the eighth bit of T ↔
The fourth bit of U(T(8)) = 1.
![Page 57: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/57.jpg)
Formally the running time is Θ(mn). However, the method is very efficient if n is the size
of a single or a few computer words.
Furthermore only two columns of M are needed at any given time. Hence, the space used by the algorithm is O(n).
How much did we pay?
![Page 58: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/58.jpg)
We extend the shift-and method for finding inexact occurrences of a pattern in a text.
Reminder example:T = aatatccacaa P = atcgaa
P appears in T with 2 mismatches starting at position 4,it also occurs with 4 mismatches starting at position 2. a a t a t c c a c a a a a t a t c c a c a a
a t c g a a a t c g a a
agrep: The Shift-And Method with errors
![Page 59: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/59.jpg)
Our current goal given k find all the occurrences of P in T with up to k mismatches.
We define the matrix Mk to be an n by m binary matrix, such that:
Mk (i,j) = 1 iffAt least i-k of the first i characters of P match the i characters up through character j of T.
What is M0? How does Mk solve the k-mismatch problem?
agrep
![Page 60: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/60.jpg)
We compute Ml for all l=0, … , k. For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by:
Computing Mk
![Page 61: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/61.jpg)
The first i-1 characters of P match a substring of T ending at j-1, with at most l mismatches, and the next pair of characters in P and T are equal.
Computing Mk
* * * * *
* * * * *
j-1
i-1
))(())1(( jTUjMBitShift l
![Page 62: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/62.jpg)
The first i-1 characters of P match a substring of T ending at j-1, with at most l -1 mismatches.
Computing Mk
* * * * *
* * * * *
j-1
i-1
))1(( 1 jMBitShift l
![Page 63: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/63.jpg)
We compute Ml for all l=1, … , k. For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by:
Computing Mk
))1((
))](())1(([
)(
1
jMBitShift
jTUjMBitShift
jM
l
l
l
![Page 64: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/64.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
P = a b a a c
M0=
Example: M1
1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 1 1
2 0 0 1 0 0 1 0 1 1 0
3 0 0 0 1 0 0 1 0 0 1
4 0 0 0 0 1 0 0 1 0 0
5 0 0 0 0 0 0 0 0 1 0
1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 1 0 1 1 0 1
2 0 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 0 0 0 0
![Page 65: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/65.jpg)
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a x a
P = a b a a
Example: M1
1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 1 1
2 0 0 1 0 0 1 0 1 1 0
3 0 0 0 1 0 0 1 0 0 1
4 0 0 0 0 1 0 0 1 0 0
5 0 0 0 0 0 0 0 0 1 0
0
1
1
0
1
)(aU
![Page 66: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/66.jpg)
Formally the running time is Θ(kmn). Again, the method is practically efficient for small n. Still only a constant number of columns of M are
needed at any given time. Hence, the space used by the algorithm is O(n).
How much did we pay?
![Page 67: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/67.jpg)
The match count problem
![Page 68: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/68.jpg)
We want to count the exact number of characters that match each of the different alignments of P with T.
a a t a t c c a c a a a a t a t c c a c a a
a t c g a a a t c g a a
4 2
The match-count problem
![Page 69: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/69.jpg)
We will first look at a simple algorithm which extends the techniques we’ve seen so far.
Next, we introduce a more efficient algorithm that exploits existing efficient methods to calculate the Fourier transform.
We conclude with a variation that gives good performance for unbounded alphabets.
The match-count problem
![Page 70: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/70.jpg)
We define the matrix MC to be an n by m integer valued matrix, such that:
MC(i,j) = The number of characters of P[1..i] that match T[j-I+1,..,j]
How does MC solve the match-count problem?
Match-count Algorithm 1
![Page 71: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/71.jpg)
Initialize column 0 of MC to all zeros For j ≥ 1 column j is obtained by
Total of Θ(nm) comparisons and (simple) additions.
Computing MC
1)1,1(
)1,1(),(
jiMC
jiMCjiMC
Otherwise
jTiP )()(
![Page 72: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/72.jpg)
Define a vector W that counts the matching symbols, it’s indices are the possible alignments.
T = a b a b c a a a a b a b c a a a P = a b c a a b c aW(1) = 2 W(2) = 0
a b a b c a a a a b a b c a a a a b c a a b c aW(3) = 4 W(4) = 1
a b a b c a a a
a b c aW(5) = 1
Match-count algorithm 2
![Page 73: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/73.jpg)
Let’s handle one symbol at a time:
T = a b a b c a a a
P = a b c a
Ta = 1 0 1 0 0 1 1 1 Wa(1) = 1
Pa = 1 0 0 1
1 0 1 0 0 1 1 1 Wa(3) = 2
1 0 0 1
Match-count algorithm 2
![Page 74: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/74.jpg)
We have W = Wa + Wb + Wc.
Or in the general case
Match-count algorithm 2
WW
![Page 75: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/75.jpg)
We can calculate Wα using a convolution.
Let’s rephrase the problem. X = Tα padded with n zeros on the right.
Y = Pα padded with m zeros on the right.
We have two vectors X,Y of length m+n.
Match-count algorithm 2
![Page 76: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/76.jpg)
Ta = 1 0 1 0 0 1 1 1
Pa = 1 0 0 1
X = 1 0 1 0 0 1 1 1 0 0 0 0
Y = 1 0 0 1 0 0 0 0 0 0 0 0
Match-count algorithm 2
![Page 77: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/77.jpg)
In our modified representation:
Where the indices are taken modulo n+m.
W(1) = < 1 0 1 0 0 1 1 1 0 0 0 0,
1 0 0 1 0 0 0 0 0 0 0 0 >W(2) = < 0 1 0 1 0 0 1 1 1 0 0 0 ,
0 1 0 0 1 0 0 0 0 0 0 0 >
Match-count algorithm 2
1
0
)()()(mn
j
jiYjXiW
![Page 78: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/78.jpg)
In our modified representation:
Where the indices are taken modulo n+m.
This is the convolution of X and the reverse of Y.
Using FFT calculating convolution takes timeO(m log(m)).
Match-count algorithm 2
1
0
)()()(mn
j
jiYjXiW
![Page 79: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/79.jpg)
The total running time is O(|∑| m log(m))
What happens if |∑| is large?
For example when |∑| =n, we get O(n m log(m)) which is actually worse than the naïve algorithm.
Match-count algorithm 2
![Page 80: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/80.jpg)
An idea: some symbols might appear more often than others.
Use convolutions for the frequent symbols. Use a more simple counting method for the rest.
Match-count algorithm 3
![Page 81: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/81.jpg)
Say α appears less than c times in P. Record the locations of α in P
l1,…,lr r ≤c. Go over the text, when we see α at location j we
increment W(j-l1+1) , … , W(j-lr+1+1).
Rare symbols
![Page 82: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/82.jpg)
T = a b a b c a a a…P = a b c a c
l1 = 3, l2 = 5
j = 5 → W(5-3+1)++ W(5-5+1)++
W(3)++ W(1)++
a b a b c a a a… T = a b a b c a a a...
a b c a c a b c a c
Rare symbols
![Page 83: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/83.jpg)
We can do this for all the rare symbols in one sweep of T, for each position in T we make up to c updates in W.
Thus handling the rare symbols will cost us O(cm).
For the frequent symbols we pay one convolution per symbol so we pay at most O(n/c m log(m)).
How much did we pay?
![Page 84: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/84.jpg)
We choose the c that gives us the best balance
The total running time is
Determining c
)log(
)log(
)log(
2
mnc
mnc
mmc
ncm
))log(( mnmO
![Page 85: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/85.jpg)
Dan Gusfield, Algorithms on Strings, Trees and Graphs.Cambridge Univ. Press, Cambridge,1997.
References
![Page 86: Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649eb35503460f94bbb20c/html5/thumbnails/86.jpg)
The end