cs5263 bioinformatics lecture 15 & 16 exact string matching algorithms

CS5263 Bioinformatics

Lecture 15 & 16

Exact String Matching Algorithms

Definitions

• Text: a longer string T• Pattern: a shorter string P• Exact matching: find all occurrence of P in T

abayababaxababb abayababaxababb

aba aba

length m

length n

The naïve algorithm

abayababaxababb abayababaxababb

aba aba

Time complexity

• Worst case: O(mn)• Best case: O(m)

– aaaaaaaaaaaaaa vs baaaaaaa

• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– How many chars do you need to compare

before moving to the next position?

Average case time complexity

P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4

k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3

Average complexity: 4m/3Not as bad as you thought it might be

Biological sequences are not random

T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab

Plus: 4m/3 average case is still bad for long genomic sequences!

Especially if P is not in T…

Smarter algorithms:O(m + n) in worst casesub-linear in practice

String matching scenarios

• One T and one P– Search a word in a document

• One T and many P all at once– Search a set of words in a document– Spell checking

• One fixed T, many P– Search a completed genome for a short

sequence

• Two (or many) T’s for common patterns

How to speedup?

• Pre-processing T or P• Why pre-processing can save us time?

– Uncovers the structure of T or P– Determines when we can skip ahead without missing

anything– Determines when we can infer the result of character

comparisons without doing them.

ACGTAXACXTAXACGXAX

ACGTACA

Cost for exact string matching

Total cost = cost (preprocessing)

+ cost(comparison)

+ cost(output)

Constant

Minimize

Overhead

Hope: gain > overhead

Which string to preprocess?

• One T and one P– Preprocessing P?

• One T and many P all at once– Preprocessing P or T?

• One fixed T, many P (unknown)– Preprocessing T?

• Two (or many) T’s for common patterns– ???

Pattern pre-processing algs

– Karp – Rabin algorithm• Small alphabet and small pattern

– Boyer – Moore algorithm• the choice of most cases• Typically sub-linear time

– Knuth-Morris-Pratt algorithm (KMP)• grep

– Aho-Corasick algorithm• fgrep

Karp – Rabin Algorithm

• Let’s say we are dealing with binary numbersText: 01010001011001010101001

Pattern: 101100

• Convert pattern to integer101100 = 2^5 + 2^3 + 2^2 = 44

Karp – Rabin algorithm

Text: 01010001011001010101001Pattern: 101100 = 44 decimal

10111011001010101001= 2^5 + 2^3 + 2^2 + 2^1 = 4610111011001010101001= 46 * 2 – 64 + 1 = 2910111011001010101001= 29 * 2 - 0 + 1 = 5910111011001010101001= 59 * 2 - 64 + 0 = 5410111011001010101001= 54 * 2 - 64 + 0 = 44

Karp – Rabin algorithmWhat if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bitsBasic idea: hashing. 44 % 13 = 5

10111011001010101001= 46 (% 13 = 7)10111011001010101001= 46 * 2 – 64 + 1 = 29 (% 13 = 3)10111011001010101001= 29 * 2 - 0 + 1 = 59 (% 13 = 7)10111011001010101001= 59 * 2 - 64 + 0 = 54 (% 13 = 2)10111011001010101001= 54 * 2 - 64 + 0 = 44 (% 13 = 5)

Boyer – Moore algorithm

• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule

Boyer – Moore algorithm

• Right to left comparison

Skip some chars without missing any occurrence.

But how?

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab

Bad character rule

0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab

Basic bad character rule

char Right-most-position in P

tpabxab

Pre-processing:O(n)

T: xpbctbxabpqqaabpqzP: tpabxab

P: tpabxab

When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P

i = 3 Shift 3 – 1 = 2

T: xpbctbxabpqqaabpqzP: tpabxab *

P: tpabxab

When T(k) is not in P, shift left end of P to align with T(k+1)

i = 7 Shift 7 – 0 = 7

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

When rightmost T(k) in P is right to i, shift pattern P one pos

i = 5 5 – 6 < 0. so shift 1

Extended bad character rule

char Position in P

a 6, 3

b 7, 4

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

Find T(k) in P that is immediately left to i, shift P to align T(k) with that position

i = 5 5 – 3 = 2. so shift 2

Preprocessing still O(n)

Extended bad character rule

• Best possible: m / n comparisons

• Works better for large alphabet size

• In some cases the extended bad character rule is sufficiently good

• Worst-case: O(mn)

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

According to extended bad character rule

(weak) good suffix rule

P: qcabdabdab

(Weak) good suffix rule

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’

(Strong) good suffix rule

P: qcabdabdab

• Pre-processing can be done in linear time• If P in T, may take O(mn)• If P not in T, worst-case O(m+n)

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’

Lessons From B-M

• Sub-linear time is possible– But we still need to read T from disk!

• Bad cases require periodicity in P or T– matching random P with T is easy!

• Large alphabets mean large shifts• Small alphabets make complicated shift

data-structures possible• B-M better for “english” and amino-acids

than for DNA.

Algorithm KMP

• Not the fastest

• Best known

• Good for multiple pattern matching and real-time matching

• Idea– Left-to-right comparison– Shift P more chars when possible

Basic idea

tt’P

tt’P y

In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1].Sp’(i) = length(t)

Example

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aataac

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

P: aataac

1 2 3 4 50a a t a a c

All other input goes to state 0

Sp’(i) 0 1 0 0 2 0

aataac

If the next char in T is t, we go to state 3

Another example

P: abababc

a b a b a b c

Sp’(i) 0 0 0 0 0 4 0

abababab

ababaababc

Failure link

P: abababc

a b a b a b c

Sp’(i) 0 0 0 0 0 4 0

ababaababc

If a char in T fails to match at pos 7, re-compare it with

the char at pos 5

P: abababc

1 2 3 4 5 6

Sp’(i) 0 0 0 0 0 4 0

ababaababc

If the next char in T is a, go to state 5

0a b a b a c

All other input goes to state 0

Difference between Failure Link and FSA?

• Failure link– Preprocessing time and space are O(n),

regardless of alphabet size– Comparison time is at most 2m

• FSA– Preprocessing time and space are O(n ||)

• May be a problem for very large alphabet size

– Comparison time is always m.

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

Example

a a t a a c

aataac^^*

T: aacaataaaaataaccttacta

aataac.*aataac^^^^^*

aataac..*aataac.^^^^^

Each char in T may be compared multiple times. Up to n.

Time complexity: O(2m).

Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.

Example

T: aacaataaaaataaccttacta

Each char in T will be examined exactly once.

Therefore, exact m comparisons are needed.

Takes longer to do pre-processing.

1 2 3 4 50a a t a a c

1201234501234560001001

How to do pre-processing?

cs5263 bioinformatics lecture 15 & 16 exact string matching algorithms

Documents

introduction to bioinformatics -...

cs5263 bioinformatics

optimizing your exact globe and exact …€¢sql...

bioinformatics for molecular biology€¦ · bioinformatics...

2mnw/3i/3ai/3phar bachelor course introduction to...

immunological bioinformatics. the immunological...

doug brutlag 2011 bioinformatics genomics, bioinformatics

cs5263 bioinformatics lecture 17 exact string matching...

the cmbi: bioinformatics content bioinformatics ...

bioinformatics ii theoretical bioinformatics and machine...

introduction to bioinformatics introduction to...

what bioinformatics? what is bioinformatics?

cs5263 bioinformatics probabilistic modeling approaches for...

cs5263 bioinformatics lecture 11: markov chain and hidden...

| bioinformatics usc libraries bioinformatics service ·...

bioinformatics 2013 li bioinformatics btt029

bioinformatics pages 1–9lotten/pub/bioinf12linkage.pdf ·...

bioinformatics - stellenbosch universitypevsner j....

bioinformatics iii: structural bioinformatics and genome...

+ => bioinformatics: from sequence to knowledge outline:...