dynamic programming -class4- · dynamic programming recall: thechange problem otherproblems:...

Post on 04-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dynamic ProgrammingPart I: Examples

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77

Dynamic Programming

Recall: the Change ProblemOther problems: Manhattan Tourist Problem, LCS ProblemFinally: Sequence alignments

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 2 / 77

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from sourceto sink) to travel (only eastward andsouthward) with the most number ofattractions (*) in the Manhattan grid

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 3 / 77

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from sourceto sink) to travel (only eastward andsouthward) with the most number ofattractions (*) in the Manhattan grid

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 4 / 77

Manhattan Tourist Problem: Formulation

Goal: Find the longest path in a weighted grid.Input: A weighted grid G with two distinct vertices, one labeled “source"and the other labeled “sink"Output: A longest path in G from “source" to “sink"

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 5 / 77

MTP: An example

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 6 / 77

MTP: Greedy Algorithm Is Not Optimal

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 7 / 77

MTP: Simple Recursive Program

1 MT(n,m)2 if n = 0 or m = 03 return MT(n,m)4 x ←MT(n-1,m)+ length of the edge from (n − 1,m) to (n,m)

5 y← MT(n,m-1)+length of the edge from (n,m − 1) to (n,m)

6 return max{x,y}

What’s wrong with this approach?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 8 / 77

MTP: Simple Recursive Program

1 MT(n,m)2 if n = 0 or m = 03 return MT(n,m)4 x ←MT(n-1,m)+ length of the edge from (n − 1,m) to (n,m)

5 y← MT(n,m-1)+length of the edge from (n,m − 1) to (n,m)

6 return max{x,y}

What’s wrong with this approach?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 8 / 77

MTP: Dynamic Programming

Calculate optimal path score for each vertex in the graphEach vertex’s score is the maximum of the prior vertices score plus theweight of the respective edge in between

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 9 / 77

MTP: Dynamic Programming

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 10 / 77

MTP: Dynamic Programming

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 11 / 77

MTP: Dynamic Programming

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 12 / 77

MTP: Dynamic Programming

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 13 / 77

MTP: Dynamic Programming

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 14 / 77

MTP: Recurrence

Computing the score for a point (i , j) by the recurrence relation:

si ,j = max{

si−1,j + weight of the edge between (i − 1, j)and (i , j)si ,j−1 + weight of the edge between (i , j − 1)and (i , j)

The running time is n ×m for a n by m grid(n = # of rows, m = # of columns)

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 15 / 77

Manhattan is not a perfect Grid

Represented as a DAG: Directed Acyclic Graph

The score at point B is given by:

si ,j = max{

si−1,j + weight of the edge between(i − 1, j)and(i , j)si ,j−1 + weight of the edge between(i , j − 1)and(i , j)

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 16 / 77

Manhattan Is Not A Perfect Grid

Computing the score for point x is given by the recurrence relation:

sx = max{

sy + weight of vertex(y , x)wherey ∈ Predecessors(x)

Predecessors(x): set of vertices that have edges leading to x.

The running time for a graph G (V ,E ) (V is the set of all vertices and E isthe set of all edges) is O(E ) since each edge is evaluated once.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 17 / 77

Longest Path in DAG Problem

Goal: Find a longest path between two vertices in a weighted DAG

Input: A weighted DAG G with source and sink vertices

Output: A longest path in G from source to sink

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 18 / 77

Longest Path in DAG: Dynamic Programming

Suppose vertex v has indegree 3 and predecessors {u1, u2, u3}Longest path to v from source is:

sv = max

su1 + weight of edge from u1 to vsu2 + weight of edge from u2 to vsu3 + weight of edge from u3 to v

In general:

sv = maxu{su + weight of edge from u to v}

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 19 / 77

Traveling in the Grid

The only hitch is that one must decide on the order in which visit theverticesBy the time the vertex x is analyzed, the values sy for all itspredecessors y should be computed - otherwise we are in troubleWe need to traverse the vertices in some orderTry to find such order for a directed cycle

???

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 20 / 77

Topological ordering

2 different topological orderings of the DAG

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 21 / 77

Traversing the Manhattan Grid

3 different strategies:a) Column by columnb) Row by rowc) Along diagonals

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 22 / 77

Pseudo-code MTP: Dynamic Programming

1 ManhattanTourist(w↓,~w ,w↘i ,j ,n,m)2 for i ← 1 to n3 si ,0 ← si−1,0 + w↓i ,04 for j ← 1 to m5 s0,j ← s0,j−1 + ~w0,j

6 for i ← 1 to n7 for j ← 1 to m8

si ,j = max

si−1,j + w↓i ,jsi ,j−1 + ~wi ,j

si−1,j−1 + w↘i ,j9 return sn,m

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 23 / 77

Dynamic ProgrammingPart II: Edit Distance & Alignments

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 24 / 77

Aligning DNA Sequences

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 25 / 77

LCS Alignment without mismatches

LCS: Longest Common SubsequenceGiven two sequences

v = v1v2...vm and w = w1w2...wn

The LCS of v and w is a sequence of positions in

v : 1 < i1 < i2 < ... < it < m

and a sequence of positions in

w : 1 < j1 < j2 < ... < jt < n

such that it-th letter of v equals to jt-th letter of w and t is maximal

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 26 / 77

LCS: Example

Every common subsequence is a path in 2-D grid

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 27 / 77

LCS Problem as Manhattan Tourist Problem

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 28 / 77

Edit Graph for LCS Problem

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 29 / 77

Edit Graph for LCS Problem

Every path is a common subsequence.Every diagonal edge adds an extra element to common subsequence.LCS Problem: Find a path with maximum number of diagonal edges.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 30 / 77

Computing LCS

Let vi= prefix of v of length i : v1...viand wj = prefix of w of length j : w1...wj

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 31 / 77

Every Path in the Grid Corresponds to an Alignment

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 32 / 77

Aligning Sequences without Insertions and Deletions:Hamming Distance

Given two DNA sequences v and w :

The Hamming distance: dH(v ,w) = 8 is large but the sequences are verysimilar

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 33 / 77

Aligning Sequences with Insertions and Deletions

By shifting one sequence over one position:

The edit distance: dL(v ,w) = 2Hamming distance neglects insertions and deletions in DNA

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 34 / 77

Levenshtein or edit distance

DefinitionThe Levenshtein distance or edit distance dL between two sequences X andY is is the minimum number of edit operations of type

Replacement,Insertion, orDeletion,

that one needs to transform sequence X into sequence Y :

dL(X ,Y ) = min{R(X ,Y ) + I (X ,Y ) + D(X ,Y )}

Using M for match, an edit transcript is a string over the alphabet I, D,R, M that describes a transformation of X to Y .

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 35 / 77

Levenshtein or edit distance

Example: Given two stringsX = YESTERDAYY = EASTERS

.

Here is a minimum edit transcript for the above example:

edit transcript= D M I M M M M R D D

X= Y E S T E R D A YY= E A S T E R S

The edit distance dL(X ,Y ) of X ,Y is 5.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 36 / 77

Edit Distance vs Hamming Distance

Hamming distance alwayscompares i th letter of v with i th

letter of w

Hamming distance:

d(v ,w) = 8

Computing Hamming distance isa trivial task.

Edit distance may compare i th

letter of v with j th letter of w

Edit distance:

d(v ,w) = 2

Computing edit distance is anon-trivial task.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 37 / 77

Edit Distance vs Hamming Distance

Hamming distance alwayscompares i th letter of v with i th

letter of w

Hamming distance:

d(v ,w) = 8

Computing Hamming distance isa trivial task.

Edit distance may compare i th

letter of v with j th letter of w

Edit distance:

d(v ,w) = 2

Computing edit distance is anon-trivial task.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 37 / 77

Edit Distance: Example

What is the edit distance? 5?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 38 / 77

Edit Distance: Example

What is the edit distance? 5?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 38 / 77

Edit Distance: Example

Can it be done in 3 steps?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 39 / 77

Edit Distance: Example

Can it be done in 3 steps?

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 39 / 77

The Alignment Grid

Every alignment path is from sourceto sink

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 40 / 77

Alignment as a path in the Edit Graph

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 41 / 77

Alignments in Edit Graph

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 42 / 77

Alignments in Edit Graph

Every path in the edit graphcorresponds to alignment:

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 43 / 77

Alignments in Edit Graph

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 44 / 77

Alignments in Edit Graph

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 45 / 77

Alignment: Dynamic Programming

si ,j = max

si−1,j−1 + 1 if vi = wj↘si−1,j ↓si ,j−1 →

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 46 / 77

Dynamic Programming Example

Initialize 1st row and 1st columnto be all zeroesOr, to be more precise, initialize0th row and 0th column to be allzeroes

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 47 / 77

Dynamic Programming Example

si ,j = max

si−1,j−1 : value from NW +1

if vi = wj ↖si−1,j : value from N ↑si ,j−1 : value from W ←

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 48 / 77

Alignment: Backtracking

Arrows indicate where the score originated from:

↖↑←

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 49 / 77

Backtracking Example

Find a match in row and column2i=2, j=2, 5 is a match (T).j=2, i=4, 5, 7 is a match (T).Since vi = wj , si ,j = si−1,j−1 + 1

s2,2 = [s1,1 = 1] + 1

s2,5 = [s1,4 = 1] + 1

s4,2 = [s3,1 = 1] + 1

s5,2 = [s4,1 = 1] + 1

s7,2 = [s6,1 = 1] + 1

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 50 / 77

Dynamic Programming Example

Continuing with the dynamicprogramming algorithm gives thisresult.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 51 / 77

Alignment: Dynamic Programming

si ,j = max

si−1,j−1 + 1 if vi = wj↘si−1,j ↓si ,j−1 →

This recurrence corresponds to the Manhattan Tourist problem (threeincoming edges into a vertex) with all horizontal and vertical edgesweighted by zero.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 52 / 77

Alignment: Dynamic Programming

si ,j = max

si−1,j−1 + 1 if vi = wj↘si−1,j ↓si ,j−1 →

This recurrence corresponds to the Manhattan Tourist problem (threeincoming edges into a vertex) with all horizontal and vertical edgesweighted by zero.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 52 / 77

LCS Algorithm1 LCS(v,w)2 for i ← 1 to n3 si ,0←04 for j ← 1 to m5 s0,j←06 for i ← 1 to n7 for j ← 1 to m8

si ,j = max

si−1,j−1 + 1 if vi = wjsi−1,jsi ,j−1

9

bi ,j =

↑ if si ,j = si−1,j← if si ,j = si ,j−1↖ if si ,j = si−1,j−1

10 returnBioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 53 / 77

Now What?

LCS(v,w) created thealignment gridNow we need a way to readthe best alignment of v andwFollow the arrows backwardsfrom sink

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 54 / 77

Printing LCS: Backtracking

1 PrintLCS(b,v,i,j)2 if i = 0 or j = 03 return4 if bi ,j =↖5 PrintLCS(b,v,i-1,j-1)6 print vi

7 else8 if bi ,j =↑9 PrintLCS(b,v,i-1,j)10 else11 PrintLCS(b,v,i,j-1)

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 55 / 77

Now What?

Alignment:A T C G - T A C -A T - G T T A - T

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 56 / 77

Now What?

Alignment:A T C G - T A C -A T - G T T A - T

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 56 / 77

LCS Runtime

It takes O(nm) time to fill in the n ×m dynamic programming matrix.Why O(nm)? The pseudocode consists of a nested “for" loop inside ofanother “for" loop to set up a n ×m matrix.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 57 / 77

Dynamic ProgrammingPart II: Sequence Alignment

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 58 / 77

Outline

Dot plotsGlobal AlignmentScoring MatricesLocal AlignmentAlignment with Affine Gap Penalties

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 59 / 77

Types of sequence alignments

Dot plotsNumber of sequences

I pairwise: compares two sequencesI multiple: compares several sequences

Portion of aligned sequenceI global: aligns the sequences over all their lengthI local: finds subsequences with the best similarity scores

AlgorithmsI Optimal methods: Needleman-Wunsch, Smith-WatermanI Heuristics: FASTA, BLAST

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 60 / 77

Dot plot

→ the simplest way to visualize the similarity between two protein/DNAsequences is to use a similarity matrix

Identification of insertions/deletionsIdentification of direct repeats or inversionsSteps to create a dot plot

I 2D matrixI One sequence on the topI One sequence on the leftI For each matrix cell, compare the symbols and draw a point if there is

a coincidence

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 61 / 77

Dot plot

→ the simplest way to visualize the similarity between two protein/DNAsequences is to use a similarity matrix

Identification of insertions/deletionsIdentification of direct repeats or inversionsSteps to create a dot plot

I 2D matrixI One sequence on the topI One sequence on the leftI For each matrix cell, compare the symbols and draw a point if there is

a coincidence

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 61 / 77

Dot matrix sequence comparison

A dot matrix analysis is primarily a method for comparing two sequences.An (n ×m) matrix relating two sequences of length n and m respectively isproduced by placing a dot at each cell for which the corresponding symbolsmatch. Here is an example for the two sequences:

IMISSMISSISSIPPI andMYMISSISAHIPPIE:

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 62 / 77

Dot matrix sequence comparison

DefinitionLet S = s1s2 . . . sn and T = t1 . . . tm be two strings of length n and mrespectively. Let M be an n ×m matrix. Then M is a (simple) dot plot iffor i , j , 1 ≤ i ≤ n, 1 ≤ j ≤ m :

M[i , j ] ={

1 for si = tj0 else.

Note: The longest common substring within the two strings S and T isthen the longest matrix subdiagonal containing only 1s. However, ratherthan drawing the letter 1 we draw a dot.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 63 / 77

Dot matrix sequence comparison

Some of the properties of a dot plot arethe visualization is easy to understandit is easy to find common substrings, they appear as contiguous dotsalong a diagonalit is easy to find reversed substringsit is easy to discover displacementsit is easy to find repeats

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 64 / 77

Dot plots

Sequence length: n and mO(nm)

DNA Protein

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 65 / 77

Noise in Dot plot

LDL human receptorcompared to himself

The low density lipoprotein (LDL)receptor is a cell surface proteinthat plays a central role in themetabolism of cholesterolin humans and animals. Mutationsaffecting its structureand function give rise to one of the mostprevalenthuman genetic diseases, familialhypercholesterolemia.

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 66 / 77

Reducing the noise

To reduce the noise, a window size w and a stringency s are used and adot is only drawn at point (x , y) if in the next w positions at least scharacters are equal.

Example: Phage P22

w = 1, s = 1 w = 11, s = 7 w = 23, s = 15

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 67 / 77

Random similarity in Dot plots

When comparing DNA, there is a 14 probability of random matches

When comparing protein sequences there is a 120 probability of random

matchesHence, if coding DNA regions are analized: translate first, then align!You can always go back to DNA after alignment

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 68 / 77

w=1

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 69 / 77

w=3

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 70 / 77

w=3, stringency 2

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 71 / 77

DNA sequence

Simple dot plot, w = 1 w = 23, stringency = 16

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 72 / 77

Protein sequence

Simple dot plot, w = 1 w = 23, stringency = 6

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 73 / 77

Insertion Deletion & Inversion

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 74 / 77

Repeats

ABCDEFGEFGHIJKLMNO

Tandem duplication Tandem duplicationCompared to non duplicated Compared to itself

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 75 / 77

Palindromic repeat (intra chain)

5’ GGCGG 3’

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 76 / 77

Limitations of dot plots

No score to quantify identical or similar stringsRuntime is quadratic; more efficient algorithms to identify identicalsubstrings exist (eg. based on suffix trees)

Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 77 / 77

top related