alignment ii dynamic programming. 2 pair-wise sequence alignments a: c a t - t c a - c | | | | | b:...

Alignment IIDynamic Programming

Pair-wise sequence alignments

A: C A T - T C A - C

| | | | |

B: C - T C G C A G C

Idea: Display one sequence above another with spaces inserted in both to reveal similarity

Two types of alignment

S = CTGTCGCTGCACGT = TGCCGTG

CTGTCGCTGCACG---------TGC-CGTG

CTGTCG-CTGCACG

-TGC-CG-TG----

Global alignment Local alignment

Global alignment: ScoringCTGTCG-CTGCACG

-TGC-CG-TG----

Reward for matches: Mismatch penalty: Space penalty:

score(A) = w – x - y

w = #matches x = #mismatchesy = #spaces

Global alignment: Scoring

C T G T C G – C T G C - T G C – C G – T G -

-5 10 10 -2 -5 -2 -5 -5 10 10 -5

Total = 11

Reward for matches:10

Mismatch penalty: 2Space penalty: 5

Optimum Alignment

• The score of an alignment is a measure of its quality

• Optimum alignment problem: Given a pair of sequences X and Y, find an alignment (global or local) with maximum score

• The similarity between X and Y, denoted sim(X,Y), is the maximum score of an alignment of X and Y

Alignment algorithms

• Global: Needleman-Wunsch• Local: Smith-Waterman• NW and SW use dynamic

programming• Variations:

– Gap penalty functions– Scoring matrices

Global Alignment: Algorithm

1..j1..i T and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

Justification

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj-1 Tj

C(i-1,j-1) + w(Si,Tj)

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj —

C(i-1,j)

S1 S2 . . . Si —

T1 T2 . . . Tj-1 Tj

C(i,j-1)

Example

Case 1: Line up Si with Tj

S: C A T T C A C T: C - T T C A G

i - 1 i

S: C A T T C A - C T: C - T T C A G -

Case 2: Line up Si with spacei - 1 i

S: C A T T C A C - T: C - T T C A - G

Case 3: Line up Tj with spacei

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j)C(i-1,j-1)

C(i,j-1)

λ C T C G C A G C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

Traceback can yield both optimum alignments

End-gap free alignment

• Gaps at the start or end of alignment are not penalized

Best global Best end-gap free

Match: +2 Mismatch and space: -1

Score = 1 Score = 9

Motivation: Shotgun assembly

• Shotgun assembly produces large set of partially overlapping subsequences from many copies of one unknown DNA sequence.

• Problem: Use the overlapping sections to ”paste” the subsequences together.

• Overlapping pairs will have low global alignment score, but high end-space free score because of overlap.

Motivation: Shotgun assembly

Algorithm

• Same as global alignment, except:– Initialize with zeros (free gaps at

start)– Locate max in the last row/column

(free gaps at end)

10 5 10 5 10 5 0 10

λ C T C G C A G C

+10 for match, -2 for mismatch, -5 for gap

0 0 0 0 0 0 0 0 0

5 8 5 8 5 20 15 10

0 15 10 5 6 15 18 13

-2 10 13 8 3 10 13 16

10 5 20 15 18 13 8 23 5 8 15 18 13 28 23 18

0 3 10 25 20 23 38 33

Local Alignment: Motivation

• Ignoring stretches of non-coding DNA:– Non-coding regions are more likely to be

subjected to mutations than coding regions.– Local alignment between two sequences is

likely to be between two exons.

• Locating protein domains:– Proteins of different kind and of different

species often exhibit local similarities– Local similarities may indicate ”functional

subunits”.

Local alignment: Example

Best local alignment:

Match: +2 Mismatch and space: -1

Score = 5

S = g g t c t g a gT = a a a c g a

g g t c t g a ga a a c – g a -

Local Alignment: Algorithm

Initialize top row and leftmost column to zero.

,]1,1[

jtisscorejiC

C [i, j] = Score of optimally aligning a suffix of s with a suffix of t.

0 0 0 0 0 0 0 0 0

0 1 0 1 0 1 0 0 1

0 0 0 0 0 0 2 0 0

0 0 1 0 0 0 0 1 0

0 0 1 0 0 0 0 0 0

0 1 0 2 0 1 0 0 1

0 0 0 0 1 0 2 0 0

0 1 0 1 0 2 0 1 1

λ C T C G C A G C

+1 for a match, -1 for a mismatch, -5 for a space

Some Results

• Most pairwise sequence alignment problems can be solved in O(mn) time.

• Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88].

• Highly similar sequences can be aligned in O(dn) time, where d measures the distance between the sequences [Landau86].

Reducing space requirements

• O(mn) tables are often the limiting factor in computing large alignments

• There is a linear space technique that only doubles the time required [Hirschberg77]

0 10 5 10 5 10 5 0 10

λ C T C G C A G C

IDEA: We only need the previous row to calculate the next

0 0 0 0 0 0 0 0 0λ

0 5 8 5 8 5 20 15 10

Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn

Affine Gap Penalty Functions

Gap penalty = h + gk

k = length of gaph = gap opening penaltyg = gap continuation penalty

Can also be solved in O(nm) time using dynamic programming

alignment ii dynamic programming. 2 pair-wise sequence alignments a: c a t - t c a - c | | | | | b:...

c t t c

c t t c

t t c

c t c g c

c t c g c

c traceback

ctgtcgctgcacg t

t j ci

Documents

how is the amino acid sequence determined? the mrna each...

s e i s m i c - dpnet.org.np e i s m i...

a remarkable q,t-catalan sequence and q - mathematics

centrforintegrative bioinformaticsvu e [1] 06-11-2006...

scope and sequence - amazon s3 · 60 system b scope and...

a g c c g a t g t c g t c g t c t a c c a ...

testbankcollege.eutestbankcollege.eu/sample/test-bank-statistics-for... ·...

warm-up: 5 minutes 1) give the complementary dna sequence...

what comes next here is a sequence of letters: o t t f f s...

pairwisesequence...

c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t...

the susa sequence-3000-2000 b. c

multiple sequence alignment benchmarking, pattern...

c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t...

identification of point mutation resulting heat-labile...

richard deem, paradoxes class, march 16, 2014. nucleus a a t...

c e n t r f o r i n t e g r a t i v e b i o i n f o r m a t...

{ perturbed chebyshev polynomials { richard brak ·...

creating multiple sequence alignments - plant … · web...

contentssolution of ricci flow with surgery (rfs) is a...