cs5263 bioinformatics lecture 15 & 16 exact string matching algorithms
TRANSCRIPT
CS5263 Bioinformatics
Lecture 15 & 16
Exact String Matching Algorithms
Definitions
• Text: a longer string T• Pattern: a shorter string P• Exact matching: find all occurrence of P in T
abayababaxababb abayababaxababb
aba aba
T
P
length m
length n
The naïve algorithm
abayababaxababb abayababaxababb
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
Time complexity
• Worst case: O(mn)• Best case: O(m)
– aaaaaaaaaaaaaa vs baaaaaaa
• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– How many chars do you need to compare
before moving to the next position?
Average case time complexity
P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4
k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3
Average complexity: 4m/3Not as bad as you thought it might be
Biological sequences are not random
T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab
Plus: 4m/3 average case is still bad for long genomic sequences!
Especially if P is not in T…
Smarter algorithms:O(m + n) in worst casesub-linear in practice
String matching scenarios
• One T and one P– Search a word in a document
• One T and many P all at once– Search a set of words in a document– Spell checking
• One fixed T, many P– Search a completed genome for a short
sequence
• Two (or many) T’s for common patterns
How to speedup?
• Pre-processing T or P• Why pre-processing can save us time?
– Uncovers the structure of T or P– Determines when we can skip ahead without missing
anything– Determines when we can infer the result of character
comparisons without doing them.
ACGTAXACXTAXACGXAX
ACGTACA
Cost for exact string matching
Total cost = cost (preprocessing)
+ cost(comparison)
+ cost(output)
Constant
Minimize
Overhead
Hope: gain > overhead
Which string to preprocess?
• One T and one P– Preprocessing P?
• One T and many P all at once– Preprocessing P or T?
• One fixed T, many P (unknown)– Preprocessing T?
• Two (or many) T’s for common patterns– ???
Pattern pre-processing algs
– Karp – Rabin algorithm• Small alphabet and small pattern
– Boyer – Moore algorithm• the choice of most cases• Typically sub-linear time
– Knuth-Morris-Pratt algorithm (KMP)• grep
– Aho-Corasick algorithm• fgrep
Karp – Rabin Algorithm
• Let’s say we are dealing with binary numbersText: 01010001011001010101001
Pattern: 101100
• Convert pattern to integer101100 = 2^5 + 2^3 + 2^2 = 44
Karp – Rabin algorithm
Text: 01010001011001010101001Pattern: 101100 = 44 decimal
10111011001010101001= 2^5 + 2^3 + 2^2 + 2^1 = 4610111011001010101001= 46 * 2 – 64 + 1 = 2910111011001010101001= 29 * 2 - 0 + 1 = 5910111011001010101001= 59 * 2 - 64 + 0 = 5410111011001010101001= 54 * 2 - 64 + 0 = 44
Karp – Rabin algorithmWhat if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bitsBasic idea: hashing. 44 % 13 = 5
10111011001010101001= 46 (% 13 = 7)10111011001010101001= 46 * 2 – 64 + 1 = 29 (% 13 = 3)10111011001010101001= 29 * 2 - 0 + 1 = 59 (% 13 = 7)10111011001010101001= 59 * 2 - 64 + 0 = 54 (% 13 = 2)10111011001010101001= 54 * 2 - 64 + 0 = 44 (% 13 = 5)
Boyer – Moore algorithm
• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule
Boyer – Moore algorithm
• Right to left comparison
x
y
y
Skip some chars without missing any occurrence.
But how?
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab
Bad character rule
0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
tpabxab
Pre-processing:O(n)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab
*^^^^
P: tpabxab
When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P
k
i = 3 Shift 3 – 1 = 2
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab *
P: tpabxab
When T(k) is not in P, shift left end of P to align with T(k+1)
k
i = 7 Shift 7 – 0 = 7
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
When rightmost T(k) in P is right to i, shift pattern P one pos
k
i = 5 5 – 6 < 0. so shift 1
Extended bad character rule
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
Find T(k) in P that is immediately left to i, shift P to align T(k) with that position
k
i = 5 5 – 3 = 2. so shift 2
Preprocessing still O(n)
Extended bad character rule
• Best possible: m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character rule is sufficiently good
• Worst-case: O(mn)
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
According to extended bad character rule
(weak) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
(Weak) good suffix rule
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’
T
P
P
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
(Strong) good suffix rule
• Pre-processing can be done in linear time• If P in T, may take O(mn)• If P not in T, worst-case O(m+n)
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’
T
P
P
z
z
Lessons From B-M
• Sub-linear time is possible– But we still need to read T from disk!
• Bad cases require periodicity in P or T– matching random P with T is easy!
• Large alphabets mean large shifts• Small alphabets make complicated shift
data-structures possible• B-M better for “english” and amino-acids
than for DNA.
Algorithm KMP
• Not the fastest
• Best known
• Good for multiple pattern matching and real-time matching
• Idea– Left-to-right comparison– Shift P more chars when possible
Basic idea
tt’P
t xT
y
tt’P y
z
z
In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1].Sp’(i) = length(t)
Example
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
Failure link
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3
FSA
P: aataac
1 2 3 4 50a a t a a c
6
a
t
All other input goes to state 0
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If the next char in T is t, we go to state 3
Another example
P: abababc
a b a b a b c
Sp’(i) 0 0 0 0 0 4 0
abab
abababab
ababaababc
Failure link
P: abababc
a b a b a b c
Sp’(i) 0 0 0 0 0 4 0
ababaababc
If a char in T fails to match at pos 7, re-compare it with
the char at pos 5
FSA
P: abababc
1 2 3 4 5 6
Sp’(i) 0 0 0 0 0 4 0
ababaababc
If the next char in T is a, go to state 5
0a b a b a c
7b
a
All other input goes to state 0
Difference between Failure Link and FSA?
• Failure link– Preprocessing time and space are O(n),
regardless of alphabet size– Comparison time is at most 2m
• FSA– Preprocessing time and space are O(n ||)
• May be a problem for very large alphabet size
– Comparison time is always m.
Failure link
P: aataac
a a t a a c
Sp’(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3
Example
a a t a a c
aataac^^*
T: aacaataaaaataaccttacta
aataac.*aataac^^^^^*
aataac..*aataac.^^^^^
Each char in T may be compared multiple times. Up to n.
Time complexity: O(2m).
Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.
Example
T: aacaataaaaataaccttacta
Each char in T will be examined exactly once.
Therefore, exact m comparisons are needed.
Takes longer to do pre-processing.
1 2 3 4 50a a t a a c
6
a t
1201234501234560001001
How to do pre-processing?