cs 6293 advanced topics: current bioinformatics lecture 5 exact string matching algorithms

90
CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

CS 6293 Advanced Topics: Current Bioinformatics

Lecture 5

Exact String Matching Algorithms

Page 2: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Overview

• Sequence alignment: two sub-problems:– How to score an alignment with errors– How to find an alignment with the best score

• Today: exact string matching – Does not allow any errors– Efficiency becomes the sole consideration

• Time and space

Page 3: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Why exact string matching?

• The most fundamental string comparison problem

• Often the core of more complex string comparison algorithms– E.g., BLAST

• Often repeatedly called by other methods– Usually the most time consuming part– Small improvement could improve overall

efficiency considerably

Page 4: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Definitions

• Text: a longer string T (length m)• Pattern: a shorter string P (length n)• Exact matching: find all occurrences of P in T

abayababaxababb abayababaxababb

aba aba

T

P

length m

length n

Page 5: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

The naïve algorithm

abayababaxababb abayababaxababb

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

aba aba

Page 6: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Time complexity

• Worst case: O(mn)• Best case: O(m)

e.g. aaaaaaaaaaaaaa vs baaaaaaa

• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– In average how many chars do you need to

compare before giving up?

Page 7: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Average case time complexity

P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4

k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3

Average complexity: 4m/3Not as bad as you thought it might be

Page 8: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Biological sequences are not random

T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab

Plus: 4m/3 average case is still bad for long genomic sequences!

Especially if this has to be done again and again

Smarter algorithms:O(m + n) in worst casesub-linear in practice

Page 9: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

How to speedup?

• Pre-processing T or P• Why pre-processing can save us time?

– Uncovers the structure of T or P– Determines when we can skip ahead without missing

anything– Determines when we can infer the result of character

comparisons without doing them.

ACGTAXACXTAXACGXAX

ACGTACA

Page 10: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Cost for exact string matching

Total cost = cost (preprocessing)

+ cost(comparison)

+ cost(output)

Constant

Minimize

Overhead

Hope: gain > overhead

Page 11: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

String matching scenarios

• One T and one P– Search a word in a document

• One T and many P all at once– Search a set of words in a document– Spell checking (fixed P)

• One fixed T, many P– Search a completed genome for short sequences

• Two (or many) T’s for common patterns• Q: Which one to pre-process?• A: Always pre-process the shorter seq, or the

one that is repeatedly used

Page 12: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Pre-processing algs

• Pattern preprocessing– Knuth-Morris-Pratt algorithm (KMP)– Aho-Corasick algorithm

• Multiple patterns

– Boyer – Moore algorithm (discuss only if have time)• The choice of most cases• Typically sub-linear time

• Text preprocessing– Suffix tree

• Very useful for many purposes

Page 13: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Algorithm KMP: Intuitive example 1

• Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches.

• Number of comparisons saved: 6

abcxabcT

abcxabcdePmismatch

abcxabcT

abcxabcde

Naïve approach:

abcxabcdeabcxabcdeabcxabcde?

Page 14: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

?

Intuitive example 2

• Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches

• Number of comparisons saved: 7

abcxabcT

abcxabcdePmismatch

abcxabcT

abcxabcde

Naïve approach:

abcxabcdeabcxabcdeabcxabcde

Should not be a c

abcxabcdeabcxabcde?

Page 15: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

KMP algorithm: pre-processing

• Key: the reasoning is done without even knowing what string T is.• Only the location of mismatch in P must be known.

tt’P

t xT

y

tt’P y

z

z

Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z)For each i, let sp(i) = length(t)

ij

ij

Page 16: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

KMP algorithm: shift rule

tt’P

t xT

y

tt’P y

z

z

Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z.

This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

ij

ijsp(i)1

Page 17: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Failure Link Example

P: aataac

a a t a a c

sp(i) 0 1 0 0 2 0

aaat

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3 (= 2 + 1)

Page 18: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Another example

P: abababc

a b a b a b c

Sp(i) 0 0 0 0 0 4 0

ababaababc

If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1)

abab

abababab

Page 19: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

KMP Example using Failure Link

a a t a a c

aataac^^*

T: aacaataaaaataaccttacta

aataac.*aataac^^^^^*

aataac..*aataac.^^^^^

Time complexity analysis:• Each char in T may be compared up to n

times. A lousy analysis gives O(mn) time.• More careful analysis: number of

comparisons can be broken to two phases:• Comparison phase: the first time a char in T

is compared to P. Total is exactly m.• Shift phase. First comparisons made after a

shift. Total is at most m.• Time complexity: O(2m)

Implicitcomparison

Page 20: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

KMP algorithm using DFA (Deterministic Finite Automata)

P: aataac

1 2 3 4 50a a t a a c

6

a t

If the next char in T is t after matching 5 chars, go to state 3

a a t a a c

If a char in T fails to match at pos 6, re-compare it with

the char at pos 3

a

Failure link

DFA

a

All other inputs goes to state 0.

Page 21: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

DFA Example

T: aacaataataataaccttacta

Each char in T will be examined exactly once.

Therefore, exactly m comparisons are made.

But it takes longer to do pre-processing, and needs more space to store the FSA.

1201234534534560001001

1 2 3 4 50a a t a a c

6

a t

a

DFA

a

Page 22: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Difference between Failure Link and DFA

• Failure link– Preprocessing time and space are O(n), regardless of

alphabet size– Comparison time is at most 2m (at least m)

• DFA– Preprocessing time and space are O(n ||)

• May be a problem for very large alphabet size• For example, each “char” is a big integer• Chinese characters

– Comparison time is always m.

Page 23: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Boyer – Moore algorithm

• Often the choice of algorithm for many cases– One T and one P– We will talk about it later if have time– In practice sub-linear

Page 24: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

The set matching problem

• Find all occurrences of a set of patterns in T• First idea: run KMP or BM for each P

– O(km + n)• k: number of patterns• m: length of text• n: total length of patterns

• Better idea: combine all patterns together and search in one run

Page 25: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

A simpler problem: spell-checking

• A dictionary contains five words:– potato– poetry– pottery– science– school

• Given a document, check if any word is (not) in the dictionary– Words in document are separated by special chars.– Relatively easy.

Page 26: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword tree for spell checking

• O(n) time to construct. n: total length of patterns.• Search time: O(m). m: length of text• Common prefix only need to be compared once. • What if there is no space between words?

p

o

t

a

t

o

e

tr

y

t

er

y

s

c

i

e

n

c

e

h o o l

1

2

3

4

5

This version of the potato gun was inspired by the Weird Science team out of Illinois

Page 27: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Aho-Corasick algorithm

• Basis of the fgrep algorithm

• Generalizing KMP– Using failure links

• Example: given the following 4 patterns:– potato– tattoo– theater– other

Page 28: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword tree

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

Page 29: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword tree

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 30: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword tree

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

O(mn) m: length of text. n: length of longest pattern

potherotathxythopotattooattoo

Page 31: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword Tree with a failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 32: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword Tree with a failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 33: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Keyword Tree with all failure links

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

Page 34: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 35: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 36: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 37: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 38: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potherotathxythopotattooattoo

Page 39: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Aho-Corasick algorithm

• O(n) preprocessing, and O(m+k) searching. – n: total length of patterns. – m: length of text– k is # of occurrence.

• Can create a DFA similar as in KMP. – Requires more space, – Preprocessing time depends on alphabet size– Search time is constant

• A: Where can this algorithm be used in previous topics?• Q: BLAST

– Given a query sequence, we generate many seed sequences (k-mers)

– Search for exact matches to these seed sequences – Extend exact matches into longer inexact matches

Page 40: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix Tree

• All algorithms we talked about so far preprocess pattern(s)– Boyer-Moore: fastest in practice. O(m) worst case.– KMP: O(m)– Aho-Corasick: O(m)

• In some cases we may prefer to pre-process T– Fixed T, varying P

• Suffix tree: basically a keyword tree of all suffixes

Page 41: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix tree

• T: xabxac

• Suffixes:1. xabxac

2. abxac

3. bxac

4. xac

5. ac

6. c

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

Naïve construction: O(m2) using Aho-Corasick.

Smarter: O(m). Very technical. big constant factor

Difference from a keyword tree: create an internal node only when there is a branch

Page 42: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix tree implementation

• Explicitly labeling sequence end

• T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

• One-to-one correspondence of leaves and suffixes

• |T| leaves, hence < |T| internal nodes

Page 43: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix tree implementation

• Implicitly labeling edges

• T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

2:2

3:$ 3:$

1

2 3

$

$4

5

1:23:$

• |Tree(T)| = O(|T| + size(edge labels))

Page 44: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix links

• Similar to failure link in a keyword tree

• Only link internal nodes having branchesx

ab

cd

ef

g

h

ij

ab

c

de

fg

h

i

j

P: xabcff

Page 45: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

ST Application 1: pattern matching

• Find all occurrence of P=xa in T– Find node v in the ST that

matches to P– Traverse the subtree

rooted at v to get the locations

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

T: xabxac

• O(m) to construct ST (large constant factor)

• O(n) to find v – linear to length of P instead of T!

• O(k) to get all leaves, k is the number of occurrence.

• Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.

Page 46: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

ST Application 2: set matching

• Find all occurrences of a set of patterns in T– Build a ST from T– Match each P to ST

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

T: xabxacP: xab

• O(m) to construct ST (large constant factor)

• O(n) to find v – linear to total length of P’s

• O(k) to get all leaves, k is the number of occurrence.

• Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size.

Page 47: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

ST application 3: repeats finding

• Genome contains many repeated DNA sequences

• Repeat sequence length: Varies from 1 nucleotide to millions– Genes may have multiple copies (50 to 10,000) – Highly repetitive DNA in some non-coding regions

• 6 to 10bp x 100,000 to 1,000,000 times

• Problem: find all repeats that are at least k-residues long and appear at least p times in the genome

Page 48: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Repeats finding

• at least k-residues long and appear at least p times in the seq– Phase 1: top-down, count label lengths (L)

from root to each node– Phase 2: bottom-up: count # of leaves

descended from each internal node

(L, N)

For each node with L >= k, and N >= p, print all leaves

O(m) to traverse tree

Page 49: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Maximal repeats finding

1. Right-maximal repeat– S[i+1..i+k] = S[j+1..j+k], – but S[i+k+1] != S[j+k+1]

2. Left-maximal repeat– S[i+1..i+k] = S[j+1..j+k]– But S[i] != S[j]

3. Maximal repeat– S[i+1..i+k] = S[j+1..j+k]– But S[i] != S[j], and S[i+k+1] != S[j+k+1]

acatgacatt

1. cat2. aca3. acat

Page 50: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Maximal repeats finding

• Find repeats with at least 3 bases and 2 occurrence– right-maximal: cat– Maximal: acat– left-maximal: aca

5:e

2

5:e

4

1234567890acatgacatt

5:e 5cat

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Page 51: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Maximal repeats finding

• How to find maximal repeat?– A right-maximal repeats with different left chars

5:e

2

5:e

4

1234567890acatgacatt

5:e 5cat

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Left char = [] g c c a a

Page 52: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

ST application 4: word enumeration

• Find all k-mers that occur at least p times– Compute (L, N) for each

node• L: total label length from

root to node • N: # leaves

– Find nodes v with L>=k, and L(parent)<k, and N>=p

– Traverse sub-tree rooted at v to get the locations

L<k

L>=k, N>=p

L = KL=k

This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Page 53: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Joint Suffix Tree (JST)

• Build a ST for more than two strings

• Two strings S1 and S2

• S* = S1 & S2

• Build a suffix tree for S* in time O(|S1| + |S2|)

• The separator will only appear in the edge ending in a leaf (why?)

Page 54: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Joint suffix tree example

• S1 = abcd

• S2 = abca

• S* = abcd&abca$a

bcd

&ab

ca

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

(2, 0)useless

Seq ID

Suffix ID

Page 55: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

To Simplify

• We don’t really need to do anything, since all edge labels were implicit.

• The right hand side is more convenient to look at

abc

d&

abc

a

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

uselessa

bcd

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Page 56: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Application 1 of JST• Longest common substring between

two sequences• Using smith-waterman

– Gap = mismatch = -infinity. – Quadratic time

• Using JST– Linear time– For each internal node v, keep a bit

vector B– B[1] = 1 if a child of v is a suffix of S1– Bottom-up: find all internal nodes with

B[1] = B[2] = 1 (green nodes)– Report a green node with the longest

label– Can be extended to k sequences. Just

use a bit vector of size k.

abc

d

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Page 57: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Application 2 of JST

• Given K strings, find all k-mers that appear in at least (or at most) d strings

• Exact motif finding problem

L< k

L >= k B = BitOR(1010, 0011) = 1011cardinal(B) = 3

3,x 3,x 4,x

B = 0011

1,x

B = 1010

cardinal(B) >= 3

Page 58: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Application 3 of JST

• Substring problem for sequence databases– Given: A fixed database of sequences (e.g., individual genomes)– Given: A short pattern (e.g., DNA signature)– Q: Does this DNA signature belong to any individual in the

database?• i.e. the pattern is a substring of some sequences in the database

• Aho-Corasick doesn’t work

– This can also be used to design signatures for individuals

• Build a JST for the database seqs• Match P to the JST• Find seq IDs from descendents

abc

d

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Seqs: abcd, abcaP1: cdP2: bc

Page 59: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Application 4 of JST

• Detect DNA contamination– For some reason when we try to clone and sequence a genome, some

DNAs from other sources may contaminate our sample, which should be detected and removed

– Given: A fixed database of sequences (e.g., possible cantamination sources)

– Given: A DNA just sequenced (e.g., DNA signature)– Q: Does this DNA contain longer enough substring from the seqs in the

database?

• Build a JST for the database seqs• Scan T using the JST

abc

d

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Contamination sources: abcd, abca

Sequence: dbcgaabctacgtctagt

Page 60: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix Tree Memory Footprint

• The space requirements of suffix trees can become prohibitive– |Tree(T)| is about 20|T| in practice

• Suffix arrays provide one solution.

Page 61: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix Arrays• Very space efficient (m integers)• Pattern lookup is nearly O(n) in practice

– O(n + log2 m) worst case with 2m additional integers

– Independent of alphabet size!

• Easiest to describe (and construct) using suffix trees– Other (slower) methods exist

a

bxa

bxa

x a b x a1

5

3

$

$

$$

$

4

2

5 2 3 4 1

abxa$a$ bxa$ xa$ xabxa$

1. xabxa2. abxa3. bxa4. xa5. a

Page 62: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix array construction

• Build suffix tree for T$

• Perform “lexical” depth-first search of suffix tree– output the suffix label of each leaf

encountered

• Therefore suffix array can be constructed in O(m) time.

Page 63: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix array pattern search

• If P is in T, then all the locations of P are consecutive suffixes in Pos.

• Do binary search in Pos to find P!– Compare P with suffix Pos(m/2)– If lexicographically less, P is in first half of T– If lexicographically more, P is in second half of T– Iterate!

Page 64: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix array pattern search

• T: xabxa$

• P: abx

a

bxa

bxa

x a b x a1

5

3

$

$

$$

$

4

2

5 2 3 4 1

abxa$a$ bxa$ xa$ xabxa$

L RMR

M

Page 65: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Suffix array binary search

• How long to compare P with suffix of T?– O(n) worst case!

• Binary search on Pos takes O(n log m) time• Worst case will be rare

– occur if many long prefixes of P appear in T• In random or large alphabet strings

– expect to do less than log m comparisons• O(n + log m) running time when combined with

LCP table– suffix tree = suffix array + LCP table

Page 66: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Summary

• One T, one P– Boyer-Moore is the choice– KMP works but not the best

• One T, many P– Aho-Corasick– Suffix Tree (array)

• One fixed T, many varying P– Suffix tree (array)

• Two or more T’s– Suffix tree, joint suffix tree

Alphabet independent

Alphabet dependent

Page 67: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Boyer – Moore algorithm

• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule

Page 68: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Boyer – Moore algorithm

• Right to left comparison

x

y

y

Skip some chars without missing any occurrence.

Resume comparison here

Page 69: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?

Page 70: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Bad character rule

0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab

Page 71: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Bad character rule

0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab

Page 72: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

tpabxab

Pre-processing:O(n)

Page 73: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqzP: tpabxab

*^^^^

P: tpabxab

When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P

k

i = 3 Shift 3 – 1 = 2

Page 74: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqzP: tpabxab *

P: tpabxab

When T(k) is not in P, shift left end of P to align with T(k+1)

k

i = 7 Shift 7 – 0 = 7

Page 75: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Basic bad character rule

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

When rightmost T(k) in P is right to i, shift pattern P by 1

k

i = 5 5 – 6 < 0. so shift 1

Page 76: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Extended bad character rule

char Position in P

a 6, 3

b 7, 4

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

Find T(k) in P that is immediately left to i, shift P to align T(k) with that position

k

i = 5 5 – 3 = 2. so shift 2

Preprocessing still O(n)

Page 77: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Extended bad character rule

• Best possible: m / n comparisons

• Works better for large alphabet size

• In some cases the extended bad character rule is sufficiently good

• Worst-case: O(mn)– Expected time is sublinear

Page 78: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

According to extended bad character rule

Page 79: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

(weak) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

Page 80: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

(Weak) good suffix rule

tx

tyt’

tyt’

Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t’.How to find t’ efficiently?

T

P

P

Page 81: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

(Strong) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

Page 82: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

(Strong) good suffix rule

0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^

P: qcabdabdab

Page 83: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

(Strong) good suffix rule

tx

tyt’

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’

T

P

P

z

z

z ≠ y

Page 84: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Example preprocessing

qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

0 0 0 0 2 0 0 2 0 0dabcab

Bad char rule Good suffix rule

dabdabcabdab

Where to shift depends on T Does not depend on T

Largest shift given by either the (extended) bad char rule or the (strong) good suffix rule is used.

Page 85: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Time complexity of BM algorithm

• Pre-processing can be done in linear time

• With strong good suffix rule, worst-case is O(m) if P is not in T– If P is in T, worst-case could be O(mn) – E.g. T = m100, P = m10

– unless a modification was used (Galil’s rule)

• Proofs are technical. Skip.

Page 86: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

How to actually do pre-processing?

• Similar pre-processing for KMP and B-M– Find matches between a suffix and a prefix

– Both can be done in linear time– P is usually short, even a more expensive

pre-processing may result in a gain overall

tt’P yxKMP

tyt’P xB-M

i

ij

j For each i, find a j. similar to DP. Start from i = 2

Page 87: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Fundamental pre-processing

• Zi: length of longest substring starting at i that matches a prefix of P– i.e. t = t’, x ≠ y, Zi = |t|– With the Z-values computed, we can get the

preprocessing for both KMP and B-M in linear time.

aabcaabxaazZ = 01003100210

• How to compute Z-values in linear time?

tt’Pi

x yi+zi-1zi1

Page 88: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Computing Z in Linear time

tt’Pl

x yrk

We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r.

tt’Pl

x yrk

We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us.

1

k-l+1

Page 89: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Computing Z in Linear time

• No char inside the box is compared twice. At most one mismatch per iteration.• Therefore, O(n).

Pk

The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison.

Pl

x yrk

Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed.1

k-l+1

Case 1:

Case 2:

Pl rk

Zk-l+1 > r-k+1. Zk = Zk-l+1

Comparison start from r1

k-l+1

Case 3:

Page 90: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Z-preprocessing for B-M and KMP

• Both KMP and B-M preprocessing can be done in O(n)

tt’i

x y

j = i+zi-1zi1

tt’ yxKMP

tyt’xB-Mij

Z j

ijFor each j sp’(j+zj-1) = z(j)

Use Z backwards