Intro to Alignment Intro to Alignment Algorithms: Algorithms:
Global and LocalGlobal and Local
Algorithmic Functions of Computational Biology
Professor Istrail
Sequence ComparisonBiomolecular sequences DNA sequences (string over 4 letter alphabet {A, C,
G, T}) RNA sequences (string over 4 letter alphabet
{ACGU}) Protein sequences (string over 20 letter alphabet
{Amino Acids})
Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.
Algorithmic Functions of Computational Biology
Professor Istrail
The Basic Similarity Analysis Algorithm
Global Similarity
• Scoring Schemes
• Edit Graphs
• Alignment = Path in the Edit Graph
• The Principle of Optimality
• The Dynamic Programming Algorithm
• The Traceback
Algorithmic Functions of Computational Biology –
Professor Istrail
Jupiter’s code: Alignment
MassaMaster
Mas-saMaster
Mass-aMaster
Massa-Master
Algorithmic Functions of Computational Biology
Professor Istrail
Sequence AlignmentInput: two sequences over the same alphabetOutput: an alignment of the two sequences
Example: GCGCATTTGAGCGA TGCGTTAGGGTGACCAA possible alignment: - GCGCATTTGAGCGA - -
TGCG - - TTAGGGTGACC
matchmismatch
indel
Algorithmic Functions of Computational Biology –
Professor Istrail
m
n
yyyY
xxxX
...21
...21
Consider two sequences
Over the alphabet
},,, TGCA
ji yx , belong to
Algorithmic Functions of Computational Biology –
Professor Istrail
Scoring Schemes
Unit-score A C G T
A
C
G
T
1
1
1
1-
-
0
0
00
0
0
0
0 00
0
0
00
0
0
00
0
00
Algorithmic Functions of Computational Biology –
Professor Istrail
Alignment
ACG| | |AGG
Score = (A,A) (C,G) (G,G)
+ +
= 1 + 0 + 1 = 2
Unit-cost
A|A
A is aligned with A
C|G
C is aligned with G
G is aligned with G
G|G
Algorithmic Functions of Computational Biology –
Professor Istrail
Gaps
ACATGGAATACAGGAAAT
ACAT GG - AAT ACA - GG AAAT
OPTIMALALIGNMENTS
SCORE 7 8
AAAGGGGGGAAA
SCORE 0 3
- - - AAAGGG GGGAAA - - -
“-” is the gap symbol
Algorithmic Functions of Computational Biology –
Professor Istrail
(x,y) = the score for aligning x with y
(-,y) = the score for aligning - with y
(x,-) = the score for aligning x with -
Algorithmic Functions of Computational Biology –
Professor Istrail
A-CG - GATCGTG
Alignment
Score
(A,A) + (G,G) +(C,C) +(-,T) + (-,T ) + (G,G)
THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS
Algorithmic Functions of Computational Biology –
Professor Istrail
Scoring Scheme
Dayhoff score
...
PTIPLSRLFDNAMLRAHRLHQSAIENQRLFNIAVSRVQHLHL
Partial alignment for Monkey and Trout somatotropin proteins
- A R N D C Q E G H I L K M F P S T W Y V
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8-8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0A
RND
64
Algorithmic Functions of Computational Biology –
Professor Istrail
Scoring Functions
Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap
The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated
Identities and conservative substitutions are Positive terms
Non-conservative substitutions are Negative terms
Algorithmic Functions of Computational Biology –
Professor Istrail
The Edit Graph
Suppose that we want to align AGT with AT
We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph.
This is the Edit Graph
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 3
0
1
2
AGT has length 3AT has length 2
The Edit graph has (3+1)*(2+1) nodes
The sequence AGT
The sequence AT
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 30
1
2
A G T
A
T
AGT indexes the columns, and AT indexes the rows of this “table”
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 30
1
2
A G T
A
T
The Graph is directed. The nodes (i,j) will hold values.
Algorithmic Functions of Computational Biology –
Professor Istrail
T0 1 2 30
1
2
A G
A
T
Algorithmic Functions of Computational Biology
Professor Istrail
T
A
T
0 1 2 30
1
2
A GA-
-A
AA
-A
-A-
A
-A
G-
A-
A-
G-
G-
T-
T-
T-
-T
-T
-T
-T
AT
GT
TT
GA
TA
Directed edges get as labels pairs of aligned letters.
Algorithmic Functions of Computational Biology –
Professor Istrail
Alignment = Path in the Edit GraphT
A
T
0 1 2 30
1
2
A GA-
-A
AA
-A
-A-
A
-A
G-
A-
A-
G-
G-
T-
T-
T-
-T
-T
-T
-T
AT
GT
TT
GA
TA
AGTA-T
Every path from Begin to End corresponds to an alignment
Every alignment corresponds to a path between Begin and End
Algorithmic Functions of Computational Biology –
Professor Istrail
The Principle of Optimality
The optimal answer to a problem is expressed in terms of optimal answer for its sub-problems
Algorithmic Functions of Computational Biology –
Professor Istrail
Dynamic Programming
Part 1: Compute first the optimal alignment score
Part 2: Construct optimal alignment
We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex
Given: Two sequences X and YFind: An optimal alignment of X with Y
Algorithmic Functions of Computational Biology –
Professor Istrail
The DP Matrix S(i,j)
0 1 2 30
1
2
A G T
A
TS(2,1)
S(1,0)
Algorithmic Functions of Computational Biology –
Professor Istrail
The DP MatrixMatrix S =[S(i,j)]
S(i,j) = The score of the maximal cost path from the Begin Vertex and the vertex (i,j)
(i,j)
(i,j-1)
(i-1,j)
(i-1,j-1)The optimal path to (i,j)must pass through one of
the vertices
(i-1,j-1)
(i-1,j)
(i,j-1)
Algorithmic Functions of Computational Biology –
Professor Istrail
Opt path
(i,j)
(i,j-1)
(i-1,j)
(i-1,j-1)
Optimal path to (i-1,j) + (- , yj)
-xi
yj-
S(i-1,j) +
(- , yj)
Algorithmic Functions of Computational Biology –
Professor Istrail
Optimal path
(i,j)
(i-1,j)
(i,j-1)
(i-1,j-1)
Optimal path to (i-1,j-1) + (xi,yj)
S(i-1,j-1) + (xi , yj)
Algorithmic Functions of Computational Biology –
Professor Istrail
Optimal path
(i,j)
(i,j-1)
(I-1,j)
(i-1,j-1)
Optimal path to (i,j-1) + (xi,-)
S(i,j-1) + (xi, -)
Algorithmic Functions of Computational Biology –
Professor Istrail
The Basic ALGORITHM
S(i,j) =
S(i-1, j-1) + (xi, yj)
S(i-1, j) + (xi, -)
S(i, j-1) + (-, yj)
MAX
Algorithmic Functions of Computational Biology –
Professor Istrail
T
A
T
0 1 2 30
1
2
A GA-
-A
AA
-A
-A-
A
-A
G-
A-
A-
G-
G-
T-
T-
T-
-T
-T
-T
-T
AT
GT
TT
GA
TA
0
0
0
0
1
1
0
1
1
0
1
2AGTA - TOptimal Alignment
Optimal Alignment and TracbackAlgorithmic Functions of Computational Biology –
Professor Istrail
S(i,j) =
S(i-1, j-1) + (xi, yj),
S(i-1, j) + (xi, -),
S(i, j-1) + (-, yj)
MAX
0, We add this
The Basic ALGORITHM: Local Similarity
Algorithmic Functions of Computational Biology –
Professor Istrail
General Scoring Schemes
1. Independence of mutations at different sites
Additive scoring scheme
2. Gaps of any length are considered one mutation
All of the efficient alignment algorithms -- employing on the dynamicprogramming method --are based fundamentally on the of the fact that the scoring function is additive.
Assumptions
Algorithmic Functions of Computational Biology –
Professor Istrail
Substitutions Matrices
m
n
yyyY
xxxX
...21
...21
ji yx ,belong to
Consider ungapped alignment of equal length sequences
Compute the probability that the two sequences are related
Compute the probability that the two sequences are not related
Compute the ratio of the two probabilities
Algorithmic Functions of Computational Biology
Professor Istrail
Random Model R
Every letter z occurs independently with probability
yjx qqRyxP i)/,(
qz
Algorithmic Functions of Computational Biology - Course 3
Professor Istrail
Match Model M
Aligned pairs of residues occur with joint probability ab
pab
jiyxpMyxP )/,(
Algorithmic Functions of Computational Biology - Course 3
Professor Istrail
)/,(
)/,(
RyxP
MyxP)/ jxixiyj qqp
),( ii yxs
)/log(),( baab pppbas
= log
=
i
where
Log-odds ratio
s(a,b) = the substitution matrixAlgorithmic Functions of Computational Biology - Course 3
Professor Istrail