lcs and extensions to global and local alignment dr. nancy warter-perez june 26, 2003
Post on 22-Dec-2015
221 views
TRANSCRIPT
LCS and Extensions to Global and Local Alignment
Dr. Nancy Warter-PerezJune 26, 2003
June 26, 2003 LCS and Extensions 2
Overview Recursion Recursive solution to hydrophobicity
sliding window problem LCS Smith-Waterman Algorithm Extensions to LCS
Global Alignment Local Alignment Affine Gap Penalties
Programming Workshop 6
June 26, 2003 LCS and Extensions 3
Project References http://www.sbc.su.se/~arne/kurser/swell/pair
wise_alignments.html Computational Molecular Biology – An
Algorithmic Approach, Pavel Pevzner Introduction to Computational Biology –
Maps, sequences, and genomes, Michael Waterman
Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology, Dan Gusfield
June 26, 2003 LCS and Extensions 4
Recursion Problems can be solved iteratively
or recursively Recursion is useful in cases where
you are building upon a partial solution
Consider the hydrophobicity problem
June 26, 2003 LCS and Extensions 5
Main.cpp#include <iostream>#include <string>using namespace std;#include "hydro.h"
double hydro[25] = {1.8,0,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,0,-3.9,3.8,1.9,-3.5,0,-1.6,-3.5,-4.5,-0.8,-0.7,0,4.2,-0.9,0,-1.3};
void main () { string seq; int ws, i; cout << "This program will compute the hydrophobicity of an sequence of
amino acids.\n"; cout << "Please enter the sequence: "<< flush; cin >> seq;
for(i = 0; i < seq.size(); i++) if((seq.data()[i] >= 'a') && (seq.data()[i] <= 'z')) seq.at(i) = seq.data()[i] - 32;
cout << "Please enter the window size: "<< flush; cin >> ws; compute_hydro(seq, ws);}
June 26, 2003 LCS and Extensions 6
Hydro.cpp#include <iostream>#include <string>using namespace std;#include "hydro.h"
void print_hydro(string seq, int ws, int i, double sum);void compute_hydro(string seq, int ws) { cout << "\n\nThe hydrophocity values are:" << endl;
print_hydro(seq, ws, seq.size()-1, 0);}void print_hydro(string seq, int ws, int i, double sum) {
if(i == -1)return;
if(i > seq.size() - ws)sum += hydro[seq.data()[i] - 'A'];
elsesum = sum - hydro[seq.data()[i+ws] - 'A'] + hydro[seq.data()
[i] - 'A'];print_hydro(seq, ws, i-1, sum);
if (i <= seq.size() - ws)cout << "Hydrophocity value:\t" << sum/ws << endl;
}
June 26, 2003 LCS and Extensions 7
hydro.h
extern double hydro[25];
void compute_hydro(string seq, int ws);
June 26, 2003 LCS and Extensions 8
Dynamic Programming Applied to optimization problems Useful when
Problem can be recursively divided into sub-problems Sub-problems are not independent
June 26, 2003 LCS and Extensions 9
Longest Common Subsequence (LCS) Problem Reference: Pevzner Can have insertion and deletions
but no substitutions (no mismatches)
Ex: V: ATCTGAT W: TGCATALCS:TCTA
June 26, 2003 LCS and Extensions 10
LCS Problem (cont.) Similarity score
si-1,j
si,j = max { si,j-1
si-1,j-1 + 1, if vi = wj
On board example: Pevzner Fig 6.1
June 26, 2003 LCS and Extensions 11
Indels – insertions and deletions (e.g., gaps)
alignment of V and W V = rows of similarity matrix (vertical axis) W = columns of similarity matrix (horizontal
axis) Space (gap) in W (UP)
insertion Space (gap) in V (LEFT)
deletion Match (no mismatch in LCS) (DIAG)
June 26, 2003 LCS and Extensions 12
LCS(V,W) Algorithmfor i = 0 to n
si,0 = 0for j = 1 to m
s0,j = 0for i = 1 to n
for j = 1 to mif vi = wj
si,j = si-1,j-1 + 1; bi,j = DIAGelse if si-1,j >= si,j-1
si,j = si-1,j; bi,j = UPelse
si,j = si,j-1; bi,j = LEFT
June 26, 2003 LCS and Extensions 13
Print-LCS(V,i,j)if i = 0 or j = 0
returnif bi,j = DIAG
PRINT-LCS(V, i-1, j-1)print vi
else if bi,j = UPPRINT-LCS(V, i-1, j)
elsePRINT-LCS(V, I, j-1)
June 26, 2003 LCS and Extensions 14
Classic Papers Needleman, S.B. and Wunsch
, C.D. A General Method Applicable to the Search for Similarities in Amino Acid Sequence of Two Proteins. J. Mol. Biol., 48, pp. 443-453, 1970.(http://poweredge.stanford.edu/BioinformaticsArchive/ClassicArticlesArchive/needlemanandwunsch1970.pdf)
Smith, T.F. and Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol., 147, pp. 195-197, 1981.(http://poweredge.stanford.edu/BioinformaticsArchive/ClassicArticlesArchive/smithandwaterman1981.pdf)
Smith, T.F. The History of the Genetic Sequence Databases. Genomics, 6, pp. 701-707, 1990. (http://poweredge.stanford.edu/BioinformaticsArchive/ClassicArticlesArchive/smith1990.pdf)
June 26, 2003 LCS and Extensions 15
Smith-Waterman (1 of 3)Algorithm
The two molecular sequences will be A=a1a2 . . . an, and B=b1b2 . . . bm. A similarity s(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wk. To find pairs of segments with high degrees of similarity, we set up a matrix H . First set
Hk0 = Hol = 0 for 0 <= k <= n and 0 <= l <= m.
Preliminary values of H have the interpretation that H i j is the maximum similarity of two segments ending in ai and bj. respectively. These values are obtained from the relationship
Hij=max{Hi-1,j-1 + s(ai,bj), max {Hi-k,j – Wk}, max{Hi,j-l - Wl }, 0} ( 1 ) k >= 1 l >= 1
1 <= i <= n and 1 <= j <= m.
June 26, 2003 LCS and Extensions 16
Smith-Waterman (2 of 3)
The formula for Hij follows by considering the possibilities for ending the segments at any ai and bj.
(1) If ai and bj are associated, the similarity is
Hi-l,j-l + s(ai,bj).
(2) If ai is at the end of a deletion of length k, the similarity is
Hi – k, j - Wk .
(3) If bj is at the end of a deletion of length 1, the similarity is
Hi,j-l - Wl. (typo in paper)
(4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj.
June 26, 2003 LCS and Extensions 17
Smith-Waterman (3 of 3)The pair of segments with maximum similarity is found by first locating the maximum element of H. The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero. This procedure identifies the segments as well as produces the corresponding alignment. The pair of segments with the next best similarity is found by applying the traceback procedure to the second largest element of H not associated with the first traceback.
June 26, 2003 LCS and Extensions 18
Extend LCS to Global Alignment
si-1,j + (vi, -)si,j = max { si,j-1 + (-, wj)
si-1,j-1 + (vi, wj)
(vi, -) = (-, wj) = - = fixed gap penalty(vi, wj) = score for match or mismatch –
can be fixed, from PAM or BLOSUM
June 26, 2003 LCS and Extensions 19
Extend to Local Alignment0 (no negative scores)si-1,j + (vi, -)
si,j = max { si,j-1 + (-, wj)si-1,j-1 + (vi, wj)
(vi, -) = (-, wj) = - = fixed gap penalty(vi, wj) = score for match or mismatch –
can be fixed, from PAM or BLOSUM
June 26, 2003 LCS and Extensions 20
Discussion on adding affine gap penalties Affine gap penalty
Score for a gap of length x-( + x)
Where > 0 is the insert gap penalty > 0 is the extend gap penalty
June 26, 2003 LCS and Extensions 21
Alignment with Gap Penalties Can apply to global or local (w/ zero) algorithms
si,j = max { si-1,j - si-1,j - ( + )
si,j = max { si1,j-1 - si,j-1 - ( + )
si-1,j-1 + (vi, wj)si,j = max { si,j
si,jNote: keeping with traversal order in Figure 6.1, is replaced by
, and is replaced by
June 26, 2003 LCS and Extensions 22
Programming Workshop 6 Implement LCS
LCS(V,W) b and s are global matrices
Print-LCS(V,i,j) Write a program that uses LCS and
Print-LCS. The program should prompt the user for 2 sequences and print the longest common sequence.