dynamic programming: edit distance bioinformatics: issues...

Post on 07-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 1 -

Bioinformatics:Issues and Algorithms

CSE 308-408 • Fall 2007 • Lecture 10

Dynamic Programming:Edit Distance

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 2 -

Outline

• Setting the Stage

• DNA Sequence Comparison: First Successes

• The Change Problem

• The Manhattan Tourist Problem

• The Longest Common Subsequence Problem

• Sequence Alignment

• The Edit Distance Problem

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 3 -

Sequence comparison and alignment

Question: how can we measure sequence similarity?

Consider:

AGTAGCATC

AGTGCACC

AGTAGCATC

GACACGATTversus

That the two DNA fragments on the left somehow seem more similar than the two on the right could be significant.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 4 -

Motivation

Why is this important?

• Given a new DNA sequence, one of the first things abiologist will want to do is search databases of known sequences to see if anyone has recorded something similar.(As we've seen, genetic sequences are long and the databases are enormous, so efficiency will be an issue.)

• Sequence similarity can provide clues about function.

• Many other problems from computational biology incorporate some notion of sequence similarity as a basic premise.

• Similarity can provide clues about evolutionary relationships.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 5 -

Sequence concepts

A sequence is a linear string of symbols over a finite alphabet.

Some basic sequence concepts:

(Note: string and sequence are often used synonymously.)

number of symbols in s (written |s|).length

sequence of length 0 (written ε).empty string

sequence that can be obtained from s by removing some symbols (ACG is a subsequence of TATCTG).

subsequence

if t is a subsequence of s, then s is a supersequence of t.

supersequence

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 6 -

Sequence concepts

More basic sequence concepts:

sequence of consecutive symbols appearing in s (ACG is not a substring of TATCTG, but TCT is).

substring

(Observation: every substring is a subsequence of the string in question, but not vice versa.)

if t is a substring of s, then s is a superstring of t.

superstring

set of consecutive indices of a string, e.g., [2..4] refers to 2nd through 4th symbols of string s, or s[2]s[3]s[4].

interval

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 7 -

Sequence concepts

And lastly:

substring of s of the form s[1..j] where 0 ≤ j ≤ |s|(when j = 0, prefix is the empty string).

prefix

substring of s of the form s[i..|s|] where 1 ≤ i ≤ |s| + 1(when i = |s| + 1, suffix is the empty string).

suffix

AT is a prefix (and substring and subsequence) of ATCCAG.

AG is a suffix (and substring and subsequence) of ATCCAG.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 8 -

Sequences

The concept of a sequence is extremely broad. In this course, we are concerned with genetic sequences. However, there are other important kinds of sequence data:

• ASCII text,• speech,• handwriting (pen-strokes).

Likewise, the same algorithmic techniques turn up again and again, often under different names:

• approximate string matching,• edit (or evolutionary) distance,• dynamic time warping.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 9 -

Sequence comparison: a false start

A

C

C

G

G

T

T

G

G

C

C

A

A

C

A

G

G

T

A

G

G

C

C

An obvious idea that comes to mind is to line up each symbol and count the number that don't match:

As we know, this is Hamming distance and forms the basis for most error correcting codes, as well as the motif-finding problem we saw earlier.

= 2

But it doesn't work for the kinds of sequences we care about:

Just one missing symbol at the start of the second sequence leads to a large distance.

= 6?

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 10 -

Sequence manipulation at the genetic level

http://www.accessexcellence.org/AB/GG/nhgri_PDFs/insertion.pdf

Genomes aren't static ...

... sequence comparison must account for this.http://www.accessexcellence.org/AB/GG/nhgri_PDFs/deletion.pdf

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 11 -

The human factor

A hard-to-read gel (arrow marks location where bands of similar

intensity appear in two different lanes):http://hshgp.genome.washington.edu/teacher_resources/99-studentDNASequencingModule.pdf

"...the error rate is generally less than 1% over the first 650 bases and then rises significantly over the remaining sequence.”http://genome.med.harvard.edu/dnaseq.html

In addition, errors can arise during the sequencing process:

TAATGGCG?

TAATGCGG?

TAATTGCGG?

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 12 -

Sequence alignment

As we have seen, the two sequences we wish to compare may have different lengths. As a result, we need to allow for deletions and insertions.

The notion of an alignment helps us visualize this:

Alignment permits us to incorporate “spaces” (represented by a dash) in one or both sequences to make them the same length.

A

C

C

G

G

T

T

G

G

C

C?

A

C

C

G

G

T

T

G

G

C

C

-

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 13 -

Sequence alignment

No – this one has 5! How can we find the best alignment?

A

C

C

G

G

T

T

C

G

C

C

- T

-

G

G

C

C

-

T

-

G

C

C

T

T

G

G

C

-

This alignment has 6 mismatches. Is it the best possible?

A

C

C

G

G

T

T

C

G

C

C

- T G

G

C

C T G

C

-

T

T

G

G

C

-C

- -

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 14 -

DNA sequence comparison

First successes:

• Finding sequence similarities with genes of known function is a common approach to infer function of a newly-sequenced gene.

• In 1984, Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 15 -

Cystic Fibrosis

• Cystic fibrosis (CF) is chronic and often fatal genetic disease of body's mucus glands (abnormally high level of mucus). Primarily affects respiratory systems in children.

• Mucus is slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva.

Inheritance of Cystic Fibrosis:• In early 1980's, biologists hypothesized that CF is an

autosomal recessive disorder caused by mutations in a gene that remained unknown until 1989.

• Heterozygous carriers are asymptomatic.• Must be homozygously recessive in this gene in order to be

diagnosed with CF.http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 16 -

Cystic Fibrosis: finding the gene

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 17 -

Cystic Fibrosis

• ATP binding proteins are present on cell membrane and act as transport channel.

• In 1989, biologists found similarity between cystic fibrosis gene and ATP binding proteins.

• A plausible function for cystic fibrosis gene, given that CF involves sweet secretion with abnormally high sodium level.

Finding Similarities between Cystic Fibrosis Gene and ATP binding proteins:

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 18 -

Cystic Fibrosis: mutation analysis

If a high percentage of cystic fibrosis (CF) patients have a certain mutation in the gene, while unaffected patients don’t, that could be an indicator of a mutation that is related to CF.

Indeed, a certain mutation was found in 70% of CF patients; this is convincing evidence of a predominant genetic diagnostics marker for CF.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 19 -

Cystic Fibrosis and the CFTR protein

• CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein acts in cell membrane of epithelial cells that secrete mucus.

• These cells line airways of nose, lungs, stomach wall, etc.

• CFTR protein (1480 amino acids) regulates a chloride ion channel: adjusts “wateriness” of fluids secreted by cell.

• CF sufferers are missing a single amino acid in their CFTR.• Mucus ends up being too thick, affecting many organs.http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 20 -

Bring in the Bioinformaticians

• Gene similarities between two genes with known and unknown function alert biologists to some possibilities.

• Computing a similarity score between two genes tells how likely it is that they have similar functions.

• Dynamic programming is a technique for revealing similarities between genes.

• The Change Problem is a good problem to introduce idea of dynamic programming.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 21 -

The Change Problem

The Change Problem.Convert some amount of money M into given denominations, using fewest possible number of coins.Input: An amount of money M, and an array of d denominations, c = (c1, c2, …, cd), with (c1 > c2 > … > cd).Output: A list of d integers, i1, i2, …, id, such that

c1i1 + c2i2 + … + cdid = Mand i1 + i2 + … + id is minimal.

You hand cashier $20 for a $19.23 bill. Your change could be:

etc.

3 quarters + 2 pennies 77 pennies

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 22 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

1 2 3 4 5 6 7 8 9 10

1 1 1

Value

Min # of coins

1 2 3 4 5 6 7 8 9 10

1 1 1

Value

Min # of coins

Only one coin needed to make change for values 1, 3, and 5.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 23 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

However, two coins needed to make change for values 2, 4, 6, 8, and 10.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 2 2

Value

Min # of coins

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 24 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

Lastly, three coins needed to make change for values 7 and 9.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 3 2 3 2

Value

Min # of coins

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 25 -

The Change Problem: recurrence

minNumCoins(M) =

minNumCoins(M-1) + 1

minNumCoins(M-3) + 1

minNumCoins(M-5) + 1

min ofminNumCoins(M) =

minNumCoins(M-1) + 1

minNumCoins(M-3) + 1

minNumCoins(M-5) + 1

min of

This example is expressed by the following recurrence relation:

minNumCoins(M) =

minNumCoins(M-c1) + 1

minNumCoins(M-c2) + 1

minNumCoins(M-cd) + 1

min ofminNumCoins(M) =

minNumCoins(M-c1) + 1

minNumCoins(M-c2) + 1

minNumCoins(M-cd) + 1

min of

Given denominations c1, c2, …, cd, recurrence relation is:

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 26 -

The Change Problem: a recursive algorithm

RecursiveChange(M, c, d)if M = 0

return 0bestNumCoins ← infinityfor i ← 1 to d

if M ≥ ci

numCoins ← RecursiveChange(M – ci, c, d)if numCoins + 1 < bestNumCoins

bestNumCoins ← numCoins + 1return bestNumCoins

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 27 -

The Change Problem: a recursive algorithm

The RecursiveChange algorithm works, but it is not efficient:

• E.g., if M = 77 andc = (1,3,7), then optimal combination of coins for 70 cents is computed on 9 different occasions!

74

77

76 70

75 73 69 73 71 67 69 67 63

74 72 68

72 70 66

68 66 62

72 70 66

70 68 64

66 64 60

68 66 62

66 64 60

62 60 56

. . . . . .70 70 70 7070

• It recalculates optimal coin combination for a given amount of money repeatedly.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 28 -

The Change Problem: we can do better

Rather than re-compute a given value more than once:

• Save results of each computation for 0 to M.

• This way, we can do a reference call to find an already computed value, instead of re-computing each time.

• Running time is M x d, where M is the value of money and d is the number of denominations.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 29 -

The Change Problem: Dynamic Programming

DPChange(M, c, d)bestNumCoins0 ← 0

for m ← 1 to MbestNumCoinsm ← infinity

for i ← 1 to dif m ≥ ci

if bestNumCoinsm – ci + 1 < bestNumCoinsm

bestNumCoinsm ← bestNumCoinsm – ci + 1

return bestNumCoinsM

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 30 -

DPChange example

00

0 10 1

0 1 20 1 2

0 1 2 30 1 2 1

0 1 2 3 40 1 2 1 2

0 1 2 3 4 50 1 2 1 2 3

0 1 2 3 4 5 60 1 2 1 2 3 2

0 1 2 3 4 5 6 70 1 2 1 2 3 2 1

0 1 2 3 4 5 6 7 80 1 2 1 2 3 2 1 2

0 1 2 3 4 5 6 7 8 90 1 2 1 2 3 2 1 2 3

c = (1,3,7)M = 9

http

://w

ww

.bio

algo

rithm

s.in

fo

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 31 -

The Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.

Sink*

*

*

**

**

* *

*

*

Source

*

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 32 -

Sink*

*

*

**

**

* *

*

*

Source

*

The Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 33 -

The Manhattan Tourist Problem (MTP)

The Manhattan Tourist Problem.Find the longest path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.”Output: A longest path in G from “source” to “sink.”

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 34 -

MTP: an example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinatei c

oord

inat

e

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

95

15

23

0

20

3

4 http

://w

ww

.bio

algo

rithm

s.in

fo

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 35 -

MTP: greedy is not optimal

1 2 5

2 1 5

2 3 4

0 0 0

5

3

0

3

5

0

10

3

5

5

1

2promising start, but leads to bad choices!

source

sink18

22

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 36 -

MTP: a simple recursive program

MT(n, m)if n = 0 and m = 0

return 0if n > 0

x ← MT(n-1, m) + length of edge from (n-1, m) to (n, m)else

x ← 0if m > 0

y ← MT(n, m-1) + length of edge from (n, m-1) to (n, m)else

y ← 0return max(x, y) What's wrong with

this approach?

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 37 -

MTP: Dynamic Programming

• Calculate optimal path score for each vertex in graph.

• Each vertex’s score is maximum of prior vertices' scores plus weight of respective edges in between.

1

5

0 1

0

1

i

source

1

5S1,0 = 5

S0,1 = 1

j

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 38 -

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 33

-5

j

MTP: Dynamic Programming (continued)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 39 -

MTP: Dynamic Programming (continued)

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

103

5

-59

131-5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

http

://w

ww

.bio

algo

rithm

s.in

fo

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 40 -

MTP: Dynamic Programming (continued)

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

Greedyalgorithmfails!

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 41 -

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16 S3,3 = 16

MTP: Dynamic Programming (continued)

Done!

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 42 -

MTP: recurrence

Computing score for a point (i, j) by recurrence relation:

si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n-by-m grid, where n is the number of rows and m is the number of columns.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 43 -

But Manhattan is not a perfect grid ...

B

A3

A1

A2

B

A3

A1

A2

What about diagonals?

Score at point B is given by:

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 44 -

But Manhattan is not a perfect grid ...

Computing score sx for point x is given by recurrence relation:

Predecessors(x) = set of vertices that have edges leading to x

Running time for a graph G(V, E) is O(E) since each edge is evaluated once

Note: V is set of all vertices and E is set of all edges.

sx = maxof

sy + weight of edge (y, x) where

y ∈ Predecessors(x)sx = max

ofsy + weight of edge (y, x) where

y ∈ Predecessors(x)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 45 -

Alignment: two-row representation

Alignment : 2 * k matrix ( k > m, n )

T - C G A T-T G C - A -A

letters of v

letters of wTT

A T C T G A TT G C A T A

v :w :

m = 7 n = 6

4 matches 2 insertions 3 deletions

Given 2 DNA sequences v and w:

http://www.bioalgorithms.info

A-

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 46 -

Aligning DNA sequences

v = ATCTGATGw = TGCATAC

n = 8m = 7

CATACGTGTAGTCTAV

W

match

insertiondeletion

mismatch

indels

4132

matches mismatches deletions insertions

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 47 -

The Longest Common Subsequence Problem (LCS)

The Longest Common Subsequence Problem.Find the longest common subsequence of two sequences.

Input: Two sequences:

v = v1 v2…vm and w = w1 w2…wn

Output: A sequence of positions in v : 1 ≤ i1 < i2 < … < it ≤ m

and a sequence of positions in w : 1 ≤ j1 < j2 < … < jt ≤ n

such that symbol ik of v matches jk of w and t is maximal.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 48 -

LCS example

A T - C T G A T C- T G C T - A - C

elements of v

elements of w-A

1

2

0

1

2

2

3

3

4

3

5

4

5

5

6

6

6

7

7

8

j coords:

i coords:

Matches shown in redpositions in v:positions in w:

2 < 3 < 4 < 6 < 8

1 < 3 < 5 < 6 < 7

Every common subsequence is a path in 2-D grid.

0

0

(0,0)→ (1,0)→ (2,1)→ (2,2)→ (3,3)→ (3,4)→ (4,5)→ (5,5)→ (6,6)→ (7,6)→ (8,7)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 49 -

LCS Dynamic Programming

Find LCS of two strings.

Input: a weighted graph G with two distinct vertices, one labeled “source” and one labeled “sink.”Output: a longest path in G from “source” to “sink.”

Solve using an MTP graph with diagonals replaced by +1 edges.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 50 -

LCS Problem as Manhattan Tourist Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 51 -

MTP Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 52 -

MTP Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Diagonal edge adds extraelement to subsequence.

LCS Problem: find path with maximum diagonal edges.

Every path is common subsequence.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 53 -

Computing LCS

si, j = max

si-1, j

si, j-1

si-1, j-1 + 1 if vi = wj

Let vi = prefix of v of length i : v1 … vi

and wj = prefix of w of length j : w1 … wj

The length of LCS(vi, wj) is computed by:

i,j

i-1,j

i,j -1

i-1,j -1

1 00

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 54 -

LCS Alignments

4

3

2

1

0

43210W A T C G

ATGT

V

0 1 2 2 3 4V = A T - G T | | |W = A T C G – 0 1 2 3 4 4

Every path in grid corresponds to an alignment:

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 55 -

Edit distance

In 1966, Levenshtein introduced edit distance between two strings as minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other.

d(v, w) = MIN number of elementary operations neededto transform v → w.

Keep in mind that LCS computation uses a MAX.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 56 -

The Edit Distance Problem

The Edit Distance Problem.Given two sequences, find the optimal series of deletions, insertions, and substitutions to transform one into the other.Input: Two sequences:

v = v1 v2…vm and w = w1 w2…wn

Output: An optimal series of basic editing operations:

e1, e2, ..., et

such then, when applied to one of the sequences, say v, it is transformed into the other sequence, w.

Here “optimal” can mean any of a number of things, including “fewest” or “lowest- / highest-cost.”

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 57 -

Edit distance vs. Hamming distance

v = ATATATATw = TATATATA

Hamming distance always compares i-th letter of v with i-th letter of w

Computing Hammingdistance is trivial task.

Hamming distance: d(v, w) = 8

w = TATATATAJust one shift

Make it all line up

v = -ATATATAT

Edit distance may compare i-th letter of v with j-th letter of w

Computing edit distance is non-trivial task.

Edit distance: d(v, w) = 2

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 58 -

Edit distance vs. Hamming distance

v = ATATATATw = TATATATA

Hamming distance always compares i-th letter of v with i-th letter of w

Hamming distance: d(v, w) = 8

w = TATATATA-Just one shift

Make it all line up

v = -ATATATAT

Edit distance may compare i-th letter of v with j-th letter of w

One insertion and one deletion.

Edit distance: d(v, w) = 2

How to find which j goes with which i ???

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 59 -

Edit distance example

TGCATAT ATCCGAT in 5 steps:

TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 60 -

TGCATAT ATCCGAT in 5 steps:

TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)

What is the edit distance? 5?

Edit distance example

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 61 -

Edit distance example

TGCATAT ATCCGAT in 4 steps:

TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 62 -

TGCATAT ATCCGAT in 4 steps:

TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)

Can it be done in 3 steps???

Edit distance example

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 63 -

Alignment grid

Every alignment path is from source to sink.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 64 -

Alignment as a path in edit graph

0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 70 1 2 3 4 5 5 6 6 7

(0,0), (1,1), (2,2), (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (5,5), (6,6), (7,6), (7,7)(7,7)

Corresponding path:

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 65 -

Alignment as a path in edit graph

and represent indels in v and w with score 0.

represent matches with score 1.

The score of the alignment path is 5.

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 66 -

Alignment as a path in edit graph

Every path inedit graphorresponds toan alignment:

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 67 -

Alignment as a path in edit graph

http://www.bioalgorithms.info

Old AlignmentOld Alignment 0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667

New AlignmentNew Alignment 0122345677v = AT_GTTAT_w = ATCG_TA_C 0123445667

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 68 -

Alignment as a path in edit graph

http://www.bioalgorithms.info

0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667

(0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 69 -

Sequence comparison: the basic algorithm

Fortunately we don't have to. The optimal alignment can be found using dynamic programming, which we've just seen.

We can't afford to enumerate all possible alignments looking for the best one – that would be an exponential search.

Dynamic programming is based on premise of computing solutions to smaller subproblems first and then using these to solve successively larger problems until we have our answer.

(Dynamic programming was invented by Richard Bellman in the 1950's. Its application to sequence comparison came later, in the 1970's.)http://fens.sabanciuniv.edu/msie/operations_research_50_years/anniversary/or50/1526-5463-2002-50-01-0048.pdf

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 70 -

Sequence alignment ...

First some ground-rules.

-

-

Not legal:

A

-

Legal:

deletion

-

T

Legal:

insertion

C

C

Legal:

match

G

C

Legal:

mismatch

http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 71 -

Sequence comparison: the basic algorithm

So, assuming we've already computed solutions for all shorter prefixes, we can compute the alignment for v[1..i] and w[1..j].

Given two sequences v and w, consider what's required to compute optimal alignment for prefixes v[1..i] and w[1..j]. Based on our rules for alignments, there are three possible cases:

v[i]

w[j]

optimal alignmentfor v[1..i-1]

and w[1..j-1]

III

or

substitution or match

-

w[j]

optimal alignmentfor v[1..i]

and w[1..j-1]

II

insertion

v[i]

-

optimal alignmentfor v[1..i-1]and w[1..j]

I

deletion

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 72 -

Sequence comparison: the basic algorithm

Conceptually, this might look something like this:

optimal alignment at v[1..i] and w[1..j]

optimal alignment at v[1..i] and w[1..j-1]

+cost of inserting w[j]

optimal alignment at v[1..i-1] and w[1..j]

+cost of deleting v[i]

optimal alignment atv[1..i-1] and w[1..j-1]

+cost of matching v[i] and w[j]

= max

Here we assume that deletions, insertions, and mismatcheshave negative costs, while matches have positive costs.

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 73 -

Sequence comparison: the basic algorithm

This computation can be viewed as building a 2-D matrix:

s t r i n g wε

s t r

i n

g

vε 0

where indel is cost of a deletion or insertion, and sub is cost of a match/mismatch involving symbols v[i] and w[j].

d[i,j] = max

d[i-1,j] + indel

d[i,j-1] + indel

d[i-1,j-1] + sub(v[i],w[j])

cost of inserting w

cost

of d

elet

ing

v

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 74 -

Sequence comparison: the basic algorithm

Stated more generally, say that our two sequences are:v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]

Where cdel, cins, and csub are the costs of a deletion, an insertion, and a substitution, respectively.

d[i,j] = max

d[i-1,j] + cdel(v[i])

d[i,j-1] + cins(w[j])

d[i-1,j-1] + csub(v[i], w[j])

1 ≤ i ≤ m, 1 ≤ j ≤ n

And:

recu

rrenc

e

d[i,0] = d[i-1,0] + cdel(v[i])d[0,j] = d[0,j-1] + cins(w[j])

Then:1 ≤ i ≤ m1 ≤ j ≤ n

initi

al

cond

ition

sd[0,0] = 0

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 75 -

Sequence comparison: the basic algorithm

Example: say that cdel = -1, cins = -1,csub = -1 if mismatch and +1 if match

ACGT

-1-2-3-4

0-10-1-2

-10-1-1-2

-2-1-1-20

-3C A T

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 76 -

Algorithm for computing edit distance

EditDistance1(v, w)for i ← 1 to n

di,0 ← di-1,0 + cdel(vi)for j ← 1 to m

d0,j ← d0,j-1 + cins(wj)for i ← 1 to n

for j ← 1 to mdi-1,j + cdel(vi)

di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)

return (dn,m)http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 77 -

Sequence comparison: some observations

Computation can progress in a number of ways:

or or

For sequences of length m and n,

v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]

What is the computation time?

How much memory (storage) is required?

O(mn)

O(mn)

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 78 -

Sequence comparison: getting an alignment

ACGT

C A T

We started with the notion of alignment. How do we get this?

-3 -1 -1 -2-2

-2-1-1

-2

-4

0

-1

-1

-2

-1

0

-30-1 0

By keeping track of optimal decisions made during algorithm,

... and then tracing back optimal path.

G

A

C

C

T

T

A

-(May not be unique).

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 79 -

Maintaining the traceback

EditDistance1(v, w)...for i ← 1 to n

for j ← 1 to mdi-1,j + cdel(vi)

di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)

if di,j = di-1,j + cdel(vi)bi,j ← if di,j = di,j-1 + cins(wj)

if di,j = di-1,j-1 + csub(vi, wj)return (dn,m, b)http://www.bioalgorithms.info

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 80 -

More examples

Comparing ACGT vs. CAT:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 81 -

More examples

Comparing ACGTAT vs. AGTTTG:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 82 -

More examples

Comparing ACGTGCGCTGCTG vs. CGTCCTGCCTGC:

A

C

C

G

G

T

T

C

G

C

C

- T G

G

C

C T G

C

-

T

T

G

G

C

-C

- -

Recall the trace we saw earlier:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 83 -

Optimality of the algorithm

How do we know this agorithm finds an optimal alignment?

We won't dig into finer details now, but basically it hinges on:

• Bellman's principle of optimality for dynamic programming.• The cost functions behaving properly (i.e., they must satisfy

the triangle inequality).

This could be an interesting topic for a final paper ...

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 84 -

Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• Continue reading IBA Chapter 6 (sequence comparison).

top related