cs5263 bioinformatics lecture 15 & 16 exact string matching algorithms

45
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Upload: samson-howard-riley

Post on 18-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

CS5263 Bioinformatics

Lecture 15 & 16

Exact String Matching Algorithms

Page 2: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Definitions

• Text: a longer string T• Pattern: a shorter string P• Exact matching: find all occurrence of P in T

abayababaxababb abayababaxababb

aba aba

T

P

length m

length n

Page 3: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

The naïve algorithm

abayababaxababb abayababaxababb

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

Page 4: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Time complexity

• Worst case: O(mn)• Best case: O(m)

– aaaaaaaaaaaaaa vs baaaaaaa

• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– How many chars do you need to compare

before moving to the next position?

Page 5: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Average case time complexity

P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4

k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3

Average complexity: 4m/3Not as bad as you thought it might be

Page 6: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Biological sequences are not random

T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab

Plus: 4m/3 average case is still bad for long genomic sequences!

Especially if P is not in T…

Smarter algorithms:O(m + n) in worst casesub-linear in practice

Page 7: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

String matching scenarios

• One T and one P– Search a word in a document

• One T and many P all at once– Search a set of words in a document– Spell checking

• One fixed T, many P– Search a completed genome for a short

sequence

• Two (or many) T’s for common patterns

Page 8: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

How to speedup?

• Pre-processing T or P• Why pre-processing can save us time?

– Uncovers the structure of T or P– Determines when we can skip ahead without missing

anything– Determines when we can infer the result of character

comparisons without doing them.

ACGTAXACXTAXACGXAX

ACGTACA

Page 9: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Cost for exact string matching

Total cost = cost (preprocessing)

+ cost(comparison)

+ cost(output)

Constant

Minimize

Overhead

Hope: gain > overhead

Page 10: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Which string to preprocess?

• One T and one P– Preprocessing P?

• One T and many P all at once– Preprocessing P or T?

• One fixed T, many P (unknown)– Preprocessing T?

• Two (or many) T’s for common patterns– ???

Page 11: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Pattern pre-processing algs

– Karp – Rabin algorithm• Small alphabet and small pattern

– Boyer – Moore algorithm• the choice of most cases• Typically sub-linear time

– Knuth-Morris-Pratt algorithm (KMP)• grep

– Aho-Corasick algorithm• fgrep

Page 12: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Karp – Rabin Algorithm

• Let’s say we are dealing with binary numbersText: 01010001011001010101001

Pattern: 101100

• Convert pattern to integer101100 = 2^5 + 2^3 + 2^2 = 44

Page 13: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Karp – Rabin algorithm

Text: 01010001011001010101001Pattern: 101100 = 44 decimal

10111011001010101001= 2^5 + 2^3 + 2^2 + 2^1 = 4610111011001010101001= 46 * 2 – 64 + 1 = 2910111011001010101001= 29 * 2 - 0 + 1 = 5910111011001010101001= 59 * 2 - 64 + 0 = 5410111011001010101001= 54 * 2 - 64 + 0 = 44

Page 14: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Karp – Rabin algorithmWhat if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bitsBasic idea: hashing. 44 % 13 = 5

10111011001010101001= 46 (% 13 = 7)10111011001010101001= 46 * 2 – 64 + 1 = 29 (% 13 = 3)10111011001010101001= 29 * 2 - 0 + 1 = 59 (% 13 = 7)10111011001010101001= 59 * 2 - 64 + 0 = 54 (% 13 = 2)10111011001010101001= 54 * 2 - 64 + 0 = 44 (% 13 = 5)

Page 15: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Boyer – Moore algorithm

• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule

Page 16: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Boyer – Moore algorithm

• Right to left comparison

x

y

y

Skip some chars without missing any occurrence.

But how?

Page 17: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?

Page 18: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab

Page 19: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Bad character rule

0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab

Page 20: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

tpabxab

Pre-processing:O(n)

Page 21: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqzP: tpabxab

*^^^^

P: tpabxab

When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P

k

i = 3 Shift 3 – 1 = 2

Page 22: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqzP: tpabxab *

P: tpabxab

When T(k) is not in P, shift left end of P to align with T(k+1)

k

i = 7 Shift 7 – 0 = 7

Page 23: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

When rightmost T(k) in P is right to i, shift pattern P one pos

k

i = 5 5 – 6 < 0. so shift 1

Page 24: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Extended bad character rule

char Position in P

a 6, 3

b 7, 4

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

Find T(k) in P that is immediately left to i, shift P to align T(k) with that position

k

i = 5 5 – 3 = 2. so shift 2

Preprocessing still O(n)

Page 25: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Extended bad character rule

• Best possible: m / n comparisons

• Works better for large alphabet size

• In some cases the extended bad character rule is sufficiently good

• Worst-case: O(mn)

Page 26: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

According to extended bad character rule

Page 27: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

(weak) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

Page 28: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

(Weak) good suffix rule

tx

tyt’

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’

T

P

P

Page 29: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

(Strong) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

Page 30: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

(Strong) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

Page 31: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

(Strong) good suffix rule

• Pre-processing can be done in linear time• If P in T, may take O(mn)• If P not in T, worst-case O(m+n)

tx

tyt’

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’

T

P

P

z

z

Page 32: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Lessons From B-M

• Sub-linear time is possible– But we still need to read T from disk!

• Bad cases require periodicity in P or T– matching random P with T is easy!

• Large alphabets mean large shifts• Small alphabets make complicated shift

data-structures possible• B-M better for “english” and amino-acids

than for DNA.

Page 33: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Algorithm KMP

• Not the fastest

• Best known

• Good for multiple pattern matching and real-time matching

• Idea– Left-to-right comparison– Shift P more chars when possible

Page 34: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Basic idea

tt’P

t xT

y

tt’P y

z

z

In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1].Sp’(i) = length(t)

Page 35: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Example

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aaat

aataac

Page 36: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

Page 37: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

FSA

P: aataac

1 2 3 4 50a a t a a c

6

a

t

All other input goes to state 0

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If the next char in T is t, we go to state 3

Page 38: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Another example

P: abababc

a b a b a b c

Sp’(i) 0 0 0 0 0 4 0

abab

abababab

ababaababc

Page 39: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Failure link

P: abababc

a b a b a b c

Sp’(i) 0 0 0 0 0 4 0

ababaababc

If a char in T fails to match at pos 7, re-compare it with

the char at pos 5

Page 40: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

FSA

P: abababc

1 2 3 4 5 6

Sp’(i) 0 0 0 0 0 4 0

ababaababc

If the next char in T is a, go to state 5

0a b a b a c

7b

a

All other input goes to state 0

Page 41: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Difference between Failure Link and FSA?

• Failure link– Preprocessing time and space are O(n),

regardless of alphabet size– Comparison time is at most 2m

• FSA– Preprocessing time and space are O(n ||)

• May be a problem for very large alphabet size

– Comparison time is always m.

Page 42: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

Page 43: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Example

a a t a a c

aataac^^*

T: aacaataaaaataaccttacta

aataac.*aataac^^^^^*

aataac..*aataac.^^^^^

Each char in T may be compared multiple times. Up to n.

Time complexity: O(2m).

Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.

Page 44: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Example

T: aacaataaaaataaccttacta

Each char in T will be examined exactly once.

Therefore, exact m comparisons are needed.

Takes longer to do pre-processing.

1 2 3 4 50a a t a a c

6

a t

1201234501234560001001

Page 45: CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

How to do pre-processing?