pairwise alignmentstaffweb.ncnu.edu.tw/shieng/pairwise_alignment.pdf · 2003. 9. 25. · pairwise...

Pairwise Alignment

Guan-Shieng Huang

shieng@ncnu.edu.tw

Dept. of CSIE, NCNU

Pairwise Alignment – p.1/55

Approach

1. Problem definition

2. Computational method (algorithms)

3. Complexity and performance

Motivations

• Reconstructing long sequences of DNA formoverlapping sequence fragments

• Determining physical and genetic maps fromprobe data under various experimentprotocols

• Database searching

• Comparing two of more sequences forsimilarities

• Protein structure prediction (building profiles)• Comparing the same gene sequenced by two

different labs

Similarity & Difference

1. Common Ancestor Assumption

2. Mutation:(a) substitution (transition, transversion)(b) deletion(c) insertion

We use indel to refer to deletion or insertion.

What is the difference between acctga andagcta?

acctgaagctgaagct - a

Key Issues

1. notion of similarity/difference

2. the scoring system used to rank alignments

3. the algorithm used to find optimal scoringalignment

4. the statistical method used to evaluate thesignificance of an alignment score

Edit Distance

Measure similarity by

1. substitution: −1

2. indel: −2

3. match: +1

a c c t g aa g c t - a1 -1 1 1 -2 1 = 1

a c c t g aa - g c t a1 -2 -1 -1 -1 1 = −3

a c c t g a- a g c t a

-2 -1 -1 -1 -1 1 = −5

x: x1x2x3 . . . xm

y: y1y2y3 . . . yn

Alphabet:• Σ = {A,G,C, T} for DNA sequence

• Σ = {A,G,C, U} for RNA sequence

• Σ = {A,C,D,E, F,G,H, I,K, L,

M,N, P,Q,R, S, T, V,W, Y } for proteins

s(a, b): the score to substitute a by b

s(a,−): delete a

s(−, b): insert b

Nomenclature

BIOLOGY COMPUTER SCIENCE- sequence - string, word- subsequence - substring (contiguous)- N/A - subsequence- N/A - exact matching- alignment - inexact matching

Algorithm for PairwiseAlignment

To find the best alignment (with the highestscore) through

• Brute-force• Dynamic programming

Brute-force Algorithm

Try all possible alignments of x and y.

F (m, n) = F (m − 1, n) + F (m, n − 1) + F (m − 1, n − 1)

k − 1

l − 1

k − 1

m + n − 1

m − 1

m + n − 1

C(m, n) = C(m − 1, n) + C(m, n − 1)

∴ F (m, n) ≥ C(m, n) =

√πn

Dynamic Programming Approach

F (i, j): the score for the best alignment betweenx1 . . . xi and y1 . . . yj.

F (i, j) = max

F (i − 1, j − 1) + 1, xi = yi (match)

F (i − 1, j − 1) − 1, xi 6= yi (substitution)

F (i − 1, j) − 2, align xi with a gap

F (i, j − 1) − 2, align yj with a gap

x1x2 . . . xi−1 xi

y1y2 . . . yj−1 yj

⇒ F (i − 1, j − 1) + s(xi, yi)

x1x2 . . . xi−1 xi

y1y2 . . . yj −⇒ F (i − 1, j) − d

x1x2 . . . xi −

y1y2 . . . yj−1 yj

⇒ F (i, j − 1) − d

Alignment Graph

F (i − 1, j − 1) F (i − 1, j)

F (i, j − 1) F (i, j)

+s(xi , y

Initial value:

F (0, 0) = 0, F (0, j) = −jd, F (i, 0) = −id.

Example- a c c t g a

- 0 -2 -4 -6 -8 -10 -12

a -2 1 -1 -3 -5 -7 -9

g -4 -1 0 -2 -4 -4 -6

c -6 -3 0 1 -1 -3 -5

t -8 -5 -2 -1 2 0 -2

a -10 -7 -4 -3 0 1 1

- 0 -2 -4 -6 -8 -10 -12

a -2 1 -1 -3 -5 -7 -9

g -4 -1 0 -2 -4 -4 -6

c -6 -3 0 1 -1 -3 -5

t -8 -5 -2 -1 2 0 -2

a -10 -7 -4 -3 0 1 1backtrace

a c c t g aa g c t - a

Complexity

1. time = O(mn)

2. space= O(mn) if we need to find out theoptimal alignment

The problem for space is more serious when m

and n are very large.

Linear-space AlignmentAlgorithm

B(i, j): the best alignment score of the suffixesxm−i+1 . . . xm and yn−j+1 . . . yn

F (i, j): forward matrix, B(i, j): backward matrixThen

F (m,n) = max0≤k≤n

2, k) + B(

2, n − k)}.

k n − k

Algorithm

1. Compute F while saving the m2

-th row.

2. Compute B while saving the m2

-th row.

3. Find the column k∗ such that

2, k∗) + B(

2, n − k∗) = F (m,n).

4. Recursively partition the problem to two sub-problems:

(a) Find the path from (0, 0) to (m2, k∗).

(b) Find the path from (m2, k∗) to (m,n).

- 0 -2 -4 -6 -8 -10 -12

a -2 1 -1 -3 -5 -7 -9

g -4 -1 0 -2 -4 -4 -6

c -6 -3 0 1 -1 -3 -5

t -8 -5 -2 -1 2 0 -2

a -10 -7 -4 -3 0 1 1(F (i, j) matrix)

- a g t c c a

- 0 -2 -4 -6 -8 -10 -12

a -2 1 -1 -3 -5 -7 -9

t -4 -1 0 0 -2 -4 -6

c -6 -3 -2 -1 1 -1 -3

g -8 -5 -2 -3 -2 0 -2

a -10 -7 -4 -3 -4 -2 1(B(i, j) matrix)

-4 -1 0 -2 -4 -4 -6

-6 -3 -2 -1 1 -1 -3

2, k∗) + B(

2, n − k∗) = F (m,n).

In this case, F (m,n) = 1 and k∗ = 2.

Hence, the best alignment of (acctga,agcta) is the

concatenation of (ac,ag) and (ctga,cta).

Analysis of Complexity

Clearly, the required space is O(min(m,n)). Fortime complexity, let T (m,n) be the time bound ofthe algorithm.Hence, we have

T (m,n) = T (bm

2c, k) + T (d

2e, n − k) + O(mn)

for some k.

T (m,n) = T (m

2, k) + T (

2, n − k) + cmn)

for some k.Suppose T (m,n) = αmn, then the right handside becomes

2· k + α

2· (n − k) + cmn =

2+ cmn.

Let α = 2c, then it equals to the left-hand side.

For more information on linear-space algorithmsin pairwise alignment, seeChao, K. M., Hardison, R. C., and Miller, W.1994. Recent developments in linear-spacealignment methods: a survey. Journal ofComputational Biology, 1:271–291.

Revisiting Dynamic Programming

• Principle of optimality• Recurrence• Bottom up

Substitution matrices

• Suppose we have two models:1. random model2. match model

• Given any two aligned sequencesx = x1 x2 . . . xn

y = y1 y2 . . . yn

where xi is aligned with yi.

• In random model R, we suppose each letter a occursindependently with some frequency qa. Hence,

Pr(x, y|R) =∏

• In match model M, letters a and b are aligned with jointprobability pab. Suppose residues a and b have beenderived indep. from some unknown residue c. Hence,

Pr(x, y|M) =∏

pxiyi.

• Define the odds ratio as

Pr(x, y|M)

Pr(x, y|R)=

i pxiyi∏

qxiqyi

• The log-odds ratio:

S =∑

s(xi, yi) where s(a, b) = log(pab

• S > 0 means that x, y are more likely to be an instanceof the match model. (Maximum Likelihood)

• BLOSUM & PAM matrices for proteins

PAM matrices

1. Dayhoff, Schwartz, Orcutt (1978)

2. The most widely used matrix is PAM250.

BLOSUM Matrices

1. Henikoff & Henikoff (1992)

2. Derived from a set of aligned, ungappedregions from protein families called theBLOCKS database.

3. BLOSUM62 is the standard for ungappedmatching.

4. BLOSUM50 is better for alignment with gaps.

BLOSUM50

Pairwise Alignment Problems

1. Global alignment (Needleman & Wunsch,1970)

2. Local alignment (Smith-Waterman, 1981)

3. End-space free alignment

4. Gap penality

The version we currently used was due to Gotoh

(1982).

Global Alignment

Given two sequences x and y, what is the maxi-

mum similarity between them? Find a best align-

Local Alignment

Given two sequences x and y, what is the maxi-

mum similarity between a subsequence of x and

a subsequence of y? Find most similar subse-

quences.

End-space Free Alignment

Global Alignment

F (i, j) = max

F (i − 1, j − 1) + s(xi, yj),

F (i − 1, j) − d,

F (i, j − 1) − d.

with initial value

F (0, 0) = 0, F (0, j) = −jd, F (i, 0) = −id.

And F (m,n) is the score.

Example

Local Alignment

Motivation:• Ignore stretches of non-coding DNA.• Protein domains

Local Alignment

F (i, j) = max

F (i − 1, j − 1) + s(xi, yj),

F (i − 1, j) − d,

F (i, j − 1) − d.

with initial value F (0, 0) = F (0, j) = F (i, 0) = 0. And the

highest value of F (i, j) over the whole matrix is the score.

Example

Ends-free Alignment

Motivation:• shotgun sequence assembly

Ends-free Alignment

F (i, j) = max

F (i − 1, j − 1) + s(xi, yj),

F (i − 1, j) − d,

F (i, j − 1) − d.

with initial value

F (0, 0) = F (0, j) = F (i, 0) = 0.

And the highest value of F (i, j) in the last column F (i∗, n)

or the last row F (m, j∗) is the score.

Example

Complexity

All of the above algorithms can be implemented

in time O(mn) and in space O(m + n).

Gap Penality

• A gap is any maximal consecutive run ofspaces in an alignment.

• The length of a gap is the number of indeloperations in it.

a t t c - - g a - t g g a c ca - - c g t g a t t - - - c c

Motivation:• Insertion or deletion of an entire sequence

often occurs as a single mutation event.• Two protein sequences might be relatively

similar over several intervals.• cDNA: the complement of mRNA

Gap Penality Models

1. constant gap penalty model: Wg × #gaps

2. affine gap penalty model: (y = ax + b)Wg × #gaps + Ws × #spaces

3. convex gap penalty model: Wg + log(q) whereq is the length of the gap.

4. arbitrary gap penalty model

Wg: gap-open penalty, Ws: gap-extension penalty

Complexity

1. constant gap penalty model:Time= O(mn)

2. affine gap penalty model:Time= O(mn)

3. convex gap penalty model:Time= O(mn lg(m + n))

4. arbitrary gap penalty model:Time = O(mn(m + n))

Conclusion

pairwise alignmentstaffweb.ncnu.edu.tw/shieng/pairwise_alignment.pdf · 2003. 9. 25. · pairwise...

Documents

pairwise alignment · 2014. 4. 3. · pairwise alignment:...

pairwise sequence alignment. the most important class of...

pairwise alignment global & local alignment

pairwise sequence alignments - bioinformaticspairwise...

pairwise alignment course - verify your cloning

pairwise sequence alignment with the smith-waterman … -...

pairwise sequence alignment

pairwise alignment, part ii

pairwise sequence alignment - algorithms in bioinformatics

pairwise sequence alignment

pairwise sequence alignment and database search

pairwise sequence alignment part 2

sequence alig sequence alignment pairwise alignment:-

pairwise alignment (bioinformatics)

pairwise sequence alignment exercise 2

pairwise alignment prelab.pdf

pairwise alignment of dna/protein sequences

pairwise sequence alignment homology, score matrix

bioinformatics pairwise alignment

from pairwise to multiple alignment