thus the path is subdivided into a set of steps. the goal is to find the optimal way for each step
DESCRIPTION
In Bioinformatics use a computational method - Dynamic Programming to align two proteins or nucleic acids - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/1.jpg)
In Bioinformatics use a computational method - Dynamic Programming
to align two proteins or nucleic acids
The term dynamic programming to describe the process of solving problems where one needs to find the best decisions one after another.
Start A Finish
At first, we select the best path from Start to A, then we select the best path from A to Finish.
The choice of the best path from A to Finish is independent of the choice of path from Start to A
![Page 2: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/2.jpg)
Thus the path is subdivided into a set of steps.The goal is to find the optimal way for each step Any step along the true optimal path must itself be the optimal path.
This is the main idea of dynamic programming method.
Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found.
![Page 3: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/3.jpg)
sequence 1 : S D V – Y sequence 2 : S R V L Y 2 -1 2-2 2
Sum of residues pair scores minus . gap penalty = -2
Total score =3
Score of new Score of previous + Score of new aligned alignment alignment pair
sequence 1 : S D V – Y T sequence 2 : S R V L Y T Score = 5 3 + 2 .
![Page 4: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/4.jpg)
There are two Sequences: A = ACGCTG, B = CATGT The best alignment ?
Question:
explain the cell in
the first row and
the first column
![Page 5: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/5.jpg)
– A C G...
– C A T...
![Page 6: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/6.jpg)
QUESTION: How do we estimate the gap?
![Page 7: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/7.jpg)
Question: How do we calculate the score of this alignment?
![Page 8: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/8.jpg)
How do we calculate the scores?
![Page 9: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/9.jpg)
Question:
How do we estimate the mismatch? 0, -1, 1?
![Page 10: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/10.jpg)
Question:
How do we estimate the match? 0, 1, 2
Thus in this alignment the penalty for a gap is… . the score for a mismatch is…
![Page 11: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/11.jpg)
Explain the score in the cell G3/ C1
Check the score for mismatch with the previous slides.
![Page 12: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/12.jpg)
Check the score in the cell G3/A2
![Page 13: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/13.jpg)
After filling in all of the values the score matrix is as follows:
![Page 14: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/14.jpg)
The next procedure is the traceback step.
The traceback step determines the actual alignment that result in the maximum score.The traceback step begins in the N,M position in the matrix, i.e. the position where both sequences are globally aligned
![Page 15: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/15.jpg)
The algorithm of the traceback: a) step begins with the last cell
Traceback takes the current cell and looks to the neighbor cells that could be direct predacessors:
to the neighbor to the left (gap in sequence #2),
the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1).
there is a G6/T5 in this case).
![Page 16: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/16.jpg)
For the current cell there are two possible predacessors with the maximum score 3.
b) If more than one possible predacessor ( left and above) with the same maximum score exists, any can be chosen.
If the diagonal neighbor has the same maximum score, diagonal way is selected to avoid a gap.
Select the best alignment and compare with the alignment at the next slide.
Variant 1: select left cell as the predacessor.
…TG …T -
![Page 17: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/17.jpg)
Question: Does your alignment coincide with this one?
Make another possible alignment (Variant 2) and then compare it with the alignment at the next slide.
![Page 18: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/18.jpg)
Question: What are the maximum scores of these two possible alignments?
Variant 2
![Page 19: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/19.jpg)
Multiple sequence alignment (MSA)
![Page 20: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/20.jpg)
What to align?
Proteins
Nucleotides
Codons
![Page 21: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/21.jpg)
When are sequences Similar ?
Apart from sequence similarity, it depends on:
• Nucleotide / protein – Nucleotide: 4 different residues → more likely
to be similar by chance– Proteins: 20 different residues → less likely to
be similar by change
• Sequence length– Short sequences → similarity by chance
![Page 22: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/22.jpg)
Similarity vs. identity
– Identity: the same residue– Similarity: similar physiochemical
characteristics (can more readily be substituted)
![Page 23: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/23.jpg)
Algorithms - overview
• Goal: comparing conserved regions• Two methods:
– Global– Local
• Three techniques– Dot plots– Dynamic programming– Word-based
![Page 24: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/24.jpg)
Local vs. global alignments
sequence 1: ACTCCGTAGGTTGGACTCCsequence 2: CTCTGGTAGGCTTACTCTG
Global alignment
Local alignment
![Page 25: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/25.jpg)
Agenda
• Sum of Pairs method
• ClustalW
• Gap penalties
![Page 26: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/26.jpg)
• Consider aligning the following 4 protein sequences
S1 = AQPILLLV
S2 = ALRLL
S3 = AKILLL
S4 = CPPVLILV
• Which MSA to choose?
A Q P I L L L V
A L R - L L - -
A K - I L L L -
C P P V L I L V
Sum of Pairs (SP) method
A Q P I L L L V
A - - L R L L -
- A K I L L L -
C P P V L I L V
![Page 27: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/27.jpg)
• Assume: c(match) = 1 ,
c(mismatch) = -1 ,
c(gap) = -2,
c(-, -) = 0.
• Then the SP score for the 4th column of the MSA would be SP(column4) = SP(I,-,I,V)
= c(I,-) + c(I,I) + c(I,V) + c(-,I) + c(-,V) + c(I,V)
= -2 + 1 + (-1) + (-2) + (-2) +(-1)
= -7
Sum of Pairs cont.
A Q P I L L L VA L R - L L - -A K - I L L L -C P P V L I L V
![Page 28: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/28.jpg)
• To find SP(MSA) we would find the score of each column mi and then SUM all SP(mi) scores to get the score MSA.
• To find the optimal score using this method we need to consider all possible MSA.
Sum of Pairs cont.
![Page 29: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/29.jpg)
• ClustalW is a progressive method for MSA• “Progressive”
• Start: pairwise determine the most related sequences
• progressively add less related sequences or groups of sequences to the initial alignment.
The ClustalW method
![Page 30: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/30.jpg)
S1
S2
S3
S4
4
3
2
1
4321
S
4S
74S
494S
SSSS
All PairwiseAlignments
S1
S3
S2
S4
Distance
Cluster Analysis
Similarity Matrix Dendrogram
From Higgins(1991) and Thompson(1994).
ClustalW steps
Multiple Alignment Step:1. Aligning S1 and S3
2. Aligning S2 and S4
3. Aligning (S1, S3) with (S2,S4)
![Page 31: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/31.jpg)
• Use a pairwise alignment method to compute all pairwise alignments amongst the sequences.
• Look at the non-gapped position and count the number of mismatches between the two sequences, then divide this value by the number of non-gapped pairs to calculate the distance
NKL-ON distance = 1/4 = 0.25
-MLNON
ClustalW Step 1
![Page 32: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/32.jpg)
Seq. S1 S2 S3 S4
S1 -
S2 0.17 -(SYMMETRIC)
S3 0.59 0.60 -
S4 0.59 0.59 0.13 -
ClustalW Step 1 continued
![Page 33: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/33.jpg)
• Construct a similarity tree (Guide tree). • The root is placed a the midpoint of the longest chain
of consecutive edges.
ClustalW Step 2
S1
S3
S2
S4
S3S4
S1S2
![Page 34: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/34.jpg)
ClustalW Step 3
• Combine alignments: – from the most closely related groups to the most distantly related
groups – going from tip of tree to the root of the tree.
• In our example, we align:– S1 with S2 (grp1) – S3 with S4 (grp2)– grp1 with grp2– continue until the root is reached.
• Each alignment involves dynamic programming by the SP score method.
• Now the complexity has been reduced to that of a series of pairwise alignments
![Page 35: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/35.jpg)
Distance between sequences - measure from the guide tree -
determines which matrix to use
•80-100% seq-id -> Blosum80 • 60-80% seq-id -> Blosum60 • 30-60% seq-id -> Blosum45 • 0-30% seq-id -> Blosum30
![Page 36: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/36.jpg)
Gap penalties
Gap Opening Penalty (GOP)
Gap extension penalty (GEP)
GTEAIVLMANKL
G---------KL
Gap Penalty: GOP+8*GEP
![Page 37: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/37.jpg)
Modifications of gap penalty
• Gap at position – low GOP (Residue specific penalties)
– gap within 8 residues? -> increase GOP (Gap separation distance)
• Hydrophilic residues– lower GOP (Hydrophilic-Hydrophobic gap penalties)
![Page 38: Thus the path is subdivided into a set of steps. The goal is to find the optimal way for each step](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c4a550346895db95218/html5/thumbnails/38.jpg)