using dynamic programming to align sequences

70
Cédric Notredame (27/06/22) Using Dynamic Programming To Align Sequences Cédric Notredame

Upload: otto-faulkner

Post on 01-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Using Dynamic Programming To Align Sequences. Cédric Notredame. Our Scope. Understanding the DP concept. Coding a Global and a Local Algorithm. Aligning with Affine gap penalties. Saving memory. Sophisticated variants…. Outline. -Coding Dynamic Programming with Non-affine Penalties. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Using Dynamic Programming To Align Sequences

Cédric Notredame

Page 2: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Our Scope

Coding a Global and a Local Algorithm

Understanding the DP concept

Aligning with Affine gap penalties

Sophisticated variants…

Saving memory

Page 3: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Outline

-Coding Dynamic Programming with Non-affine Penalties

-Adding affine penalties

-Turning a global algorithm into a local Algorithm

-Using A Divide and conquer Strategy

-The repeated Matches Algorithm

-Double Dynamic Programming

-Tailoring DP to your needs:

Page 4: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Global Alignments Without Affine Gap

penalties

Dynamic Programming

Page 5: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

How To align Two Sequences With a Gap Penalty, A Substitution

matrix and Not too Much Time

Dynamic Programming

Page 6: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A bit of History…

-DP invented in the 50s by Bellman

-Programming Tabulation

-Re-invented in 1970 by Needlman and Wunsch

-It took 10 year to find out…

Page 7: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

The Foolish Assumption

The score of each column of the alignment is independent from the rest of the alignment

It is possible to model the relationship between two sequences with:

-A substitution matrix-A simple gap penalty

Page 8: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

The Principal of DP

If you extend optimally an optimal alignment of two sub-sequences, the result remains an optimal alignment

X-XXXXXX

X-

XX

-X

Deletion

Alignment

Insertion

??+

Page 9: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Finding the score of i,j

-Sequence 1: [1-i]-Sequence 2: [1-j]

-The optimal alignment of [1-i] vs [1-j] can finish in three different manners:

X-

XX

-X

Page 10: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Finding the score of i,j

i-

ij

-j

1…i1…j-1

1…i-11…j-1

1…i-11…j

+

+

+

Three ways to buildthe alignment

1…i1…j

Page 11: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Finding the score of i,j

1…i-11…j-1

1…i1…j-1

1…i-11…j

In order to Compute the score of

1…i1…j

All we need are the scores of:

Page 12: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Formalizing the algorithm

F(i,j)= best

F(i-1,j) + Gep

F(i-1,j-1) + Mat[i,j]

F(i,j-1) + Gep X-

XX

-X

1…i1…j-1

1…i-11…j-1

1…i-11…j

+

+

+

Page 13: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Arranging Everything in a Table

- F A

-

F

A

S

T

T

1…I-11…J-1

1…I1…J-1

1…I-11…J

1…I 1…J

Page 14: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Taking Care of the Limits

In a Dynamic Programming strategy, the most delicate part is to take care of the limits:

-what happens when you start-what happens when you finish

The DP strategy relies on the idea that ALL the cells in your table have the same environment…

This is NOT true of ALL the cells!!!!

Page 15: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Taking Care of the Limits

- F A-FAS

T

T -4Match=2MisMatch=-1Gap=-1

-3

FAT---

-1

F-

-2

FA--

-1F-

-2FA--

-3FAS---

0

Page 16: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Filing Up The Matrix

Page 17: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

- F A

-

F

A

S -3

-2

-1

-1 -2

T

-3

T -4

-2+2

-2 +2-3

-2

+1 +1-4

-3

0 0+1

-2

-3 +10

+4

0 +4-1

0

+3 +30

-3

-4 0+3

0

-1 +3+2

+3

+2 +3-1

-4

-5 -1+2

-1

-2 +2+2

+5

+1 +5

0

Page 18: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Delivering the alignment: Trace-back

Score of 1…3 Vs 1…4

Optimal Aln Score

TT

S-

AAFF

Page 19: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Trace-back: possible implementation

while (!($i==0 && $j==0)) { if ($tb[$i][$j]==$sub) #SUBSTITUTION

{ $alnI[$aln_len]=$seqI[--$i]; $alnJ[$aln_len]=$seqJ[--$j]; }

elsif ($tb[$i][$j]==$del) #DELETION{ $alnI[$aln_len]='-'; $alnJ[$aln_len]=$seqJ[--$j]; }

elsif ($tb[$i][$j]==$ins) #INSERTION{ $alnI[$aln_len]=$seqI[0][--$i]; $alnJ[$aln_len]='-'; }

$aln_len++; }

Page 20: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Local Alignments Without Affine Gap

penalties

Smith and Waterman

Page 21: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Getting rid of the pieces of Junk between the

interesting bits

Smith and Waterman

Page 22: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Page 23: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

The Smith and Waterman Algorithm

F(i,j)= best

F(i-1,j) + Gep

F(i-1,j-1) + Mat[i,j]

F(i,j-1) + Gep X-

XX

-X

1…i1…j-1

1…i-11…j-1

1…i-11…j

+

+

+

0

Page 24: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

The Smith and Waterman Algorithm

0

Ignore The rest of the Matrix

Terminate a local Aln

Page 25: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Filing Up a SW Matrix

0

Page 26: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Filling up a SW matrix: borders

* - A N I C E C A T - 0 0 0 0 0 0 0 0 0C 0A 0T 0A 0N 0 D 0O 0G 0

Easy:Local alignments

NEVER start/end with a gap…

Page 27: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Filling up a SW matrix

* - A N I C E C A T - 0 0 0 0 0 0 0 0 0C 0 0 0 0 2 0 2 0 0 A 0 2 0 0 0 0 0 4 0T 0 0 0 0 0 0 0 2 6A 0 2 0 0 0 0 0 0 4N 0 0 4 2 0 0 0 0 2D 0 0 2 2 0 0 0 0 0O 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 0 0 0 0

Best Local score

Beginning of the trace-back

Page 28: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

for ($i=1; $i<=$len0; $i++) { for ($j=1; $j<=$len1; $j++)

{ if ($res0[0][$i-1] eq $res1[0][$j-1]){$s=2;}

else {$s=-1;} $sub=$mat[$i-1][$j-1]+$s; $del=$mat[$i ][$j-1]+$gep; $ins=$mat[$i-1][$j ]+$gep; if ($sub>$del && $sub>$ins && $sub>0)

{$smat[$i][$j]=$sub;$tb[$i][$j]=$subcode;} elsif($del>$ins && $del>0 )

{$smat[$i][$j]=$del;$tb[$i][$j]=$delcode;} elsif( $ins>0 )

{$smat[$i][$j]=$ins;$tb[$i][$j]=$inscode;} else {$smat[$i][$j]=$zero;$tb[$i][$j]=$stopcode;}

if ($smat[$i][$j]> $best_score) { $best_score=$smat[$i][$j]; $best_i=$i; $best_j=$j; }

} }

PrepareTraceback

Turning

NW

into

SW

Page 29: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A few things to remember

SW only works if the substitution matrix has been normalized to give a Negative score to a random alignment.

Chance should not pay when it comes to local alignments !

Page 30: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

More than One match…

-SW delivers only the best scoring Match

-If you need more than one match:-SIM (Huang and Millers)Or-Waterman and Eggert (Durbin, p91)

Page 31: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Waterman and Eggert

-Iterative algorithm:

-1-identify the best match-2-redo SW with used pairs forbidden

-Delivers a collection of non-overlapping local alignments

-Avoid trivial variations of the optimal.

-3-finish when the last interesting local extracted

Page 32: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Adding Affine Gap Penalties

The Gotoh Algorithm

Page 33: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Forcing a bit of Biology into your alignment

The Gotoh Formulation

Page 34: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Why Affine gap Penalties are Biologically better

Cost

L

Afine Gap Penalty

GOP

GEP

GOP GOP

GOP

Parsimony: Evolution takes the simplest path

(So We Think…)

Cost=gop+L*gep

Or Cost=gop+(L-1)*gep

Page 35: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

But Harder To compute…

More Than 3 Ways to extend an Alignment

X-XXXXXX

X-

XX

-X

Deletion

Alignment

Insertion

??+

Opening

Extension

Opening

Extension

Page 36: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

More Questions Need to be asked

For instance, what is the cost of an insertion ?

1…I-1 ??X1…J-1 ??X

1…I ??- 1…J ??X

1…I ??-1…J-1 ??X

GOP GEP

Page 37: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Solution:Maintain 3 Tables

Ix: Table that contains the score of every optimal alignment 1…i vs 1…j that

finishes with an Insertion in sequence X.

Iy: Table that contains the score of every optimal alignment 1…I vs 1…J that

finishes with an Insertion in sequence Y.

M: Table that contains the score of every optimal alignment 1…I vs 1…J that

finishes with an alignment between sequence X and Y

Page 38: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

The Algorithm

M(i,j)= best M(i-1,j-1) + Mat(i,j) X

X1…i-11…j-1 +Ix(i-1,j-1) + Mat(i,j)

Iy(i-1,j-1) + Mat(i,j)

X-

1…i-1 X1…j X

+

Ix(i,j)= best M(i-1,j) + gop

Ix(i-1,j) + gepX-

1…i-1 X1…j -

+

-X

1…i X1…j-1 X

+

Iy(i,j)= best M(i,j-1) + gop

Iy(i,j-1) + gep-X

1…i -1…j-1 X

+

Page 39: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

FAQ: Why isn’t One table Enough ?

In each Cell we could remember if the optimal sub-alignment finishes with a Match or a Gap?

if best (i,j)= Ix[i,j]

We have no guaranty that Ix[i,j] is a part of A[L,M]the complete optimal alignment.

The optimal alignment may go through Iy[i,j] instead

even if Ix[i,j]>Iy[i,j]

IT WOULD BE GREEDY !!!!!!

Page 40: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Trace-back?

MIx Iy

Start From BEST M(i,j)Ix(i,j)Iy(i,j)

Page 41: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Trace-back?

M Iy

Navigate from one table to the next, knowing that a gap always finishes with an aligned column…

Ix

Page 42: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Going Further ?

With the affine gap penalties, we have increased the number of possibilities when building our alignment.

CS talk of states and represent this as a Finite State Automaton (FSA are HMM cousins)

Page 43: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Going Further ?

Page 44: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Going Further ?

In Theory, there is no Limit on the number of states one may consider when doing such a computation.

Page 45: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Page 46: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Going Further ?

Imagine a pairwise alignment algorithm where the gap penalty depends on the length of the gap.

Can you simplify it realistically so that it can be efficiently implemented?

Page 47: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Ly

Lx

Page 48: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A divide and Conquer Strategy

The Myers and Miller Strategy

Page 49: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Remember Not To Run Out of Memory

The Myers and Miller Strategy

Page 50: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A Score in Linear Space

You never Need More Than The Previous Row To Compute the optimal score

Page 51: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A Score in Linear Space

For I For J

R2[i][j]=best

For J, R1[j]=R2[j]

R1R2 R2[j-1],

+gep

R1[j-1]+mat

R1[j]+gep

Page 52: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A Score in Linear Space

Page 53: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A Score in Linear Space

You never Need More Than The Previous Row To Compute the optimal score

You only need the matrix for the Trace-Back,

Or do you ????

Page 54: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

An Alignment in Linear Space

Forward Algorithm

F(i,j)=Optimal score of0…i Vs 0…j

Backward algorithm

B(i,j)=Optimal score ofM…i Vs N…j

B(i,j)+F(i,j)=Optimal score of the alignment that passes through pair i,j

Page 55: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

An Alignment in Linear Space

Backward algorithm

Forward Algorithm

Optimal B(i,j)+F(i,j)

Backward algorithm

Forward Algorithm

Page 56: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Page 57: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

An Alignment in Linear Space

Backward algorithm

Forward Algorithm

Recursive divide and conquer strategy: Myers and Miller (Durbin p35)

Page 58: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

An Alignment in Linear Space

Page 59: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

A Forward-only Strategy(Durbin, p35)

Forward Algorithm

-Keep Row M in memory

-Keep track of which Cell in RowM lead to the optimal score

-Divide on this cell

M

Page 60: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

M

M

Page 61: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

An interesting application: finding sub-optimal alignments

Backward algorithm

Forward Algorithm

Backward algorithm

Forward Algorithm

Sum over the Forw/Bward and identify the score of the best aln going through cell i,j

Page 62: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Application:Non-local models

Double Dynamic Programming

Page 63: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Outline

The main limitation of DP: Context independent measure

Page 64: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

11

9

1213

8

1314

5

Double Dynamic Programming

High Level Smith and WatermanDynamic Programming

Score=MaxS(i-1, j-1)+RMSd scoreS(i, j-1)+gpS(i, j-1)+gp{

Rigid Body Superposition where i and j are forced together

RMSd Score

Page 65: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Double Dynamic Programming

Page 66: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Application:Repeats

The Durbin Algorithm

Page 67: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Page 68: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

In The End:Wraping it Up

Page 69: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Dynamic Programming

Needleman and Wunsch: Delivers the best scoring global alignment

Smith and Waterman: NW with an extra state 0

Affine Gap Penalties: Making DP more realistic

Page 70: Using Dynamic Programming To Align Sequences

Cédric Notredame (19/04/23)

Dynamic Programming

Linear space: Using Divide and Conquer Strategies Not to run out of memory

Double Dynamic Programming, repeat extraction: DP can easily be adapted to a special need