multiple sequence alignments - unibo.it · • find pairwise alignment • trial multiple alignment...
TRANSCRIPT
![Page 1: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/1.jpg)
Multiple Sequence Multiple Sequence AlignmentsAlignments
![Page 2: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/2.jpg)
Multiple alignment
• Pairwise alignment– Infer biological relationships from string
similarity
• Multiple alignment– Infer string similarity from biological
relationships
![Page 3: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/3.jpg)
Biological Motivations
• One of the most essential tools in molecular biology – Finding highly conserved sub-regions or embedded
patterns of a set of biological sequences– Production of consensus sequence– Estimation of evolutionary distance between sequences– Prediction of protein secondary/tertiary structure– To find conserved regions
• Local multiple alignment reveals conserved regions• Conserved regions usually are key functional regions, prime targets
for drug developments
• Practically useful methods only since D. Sankoff (1987) based on phylogenetics– Before 1987 they were constructed by hand – The basic problem: no dynamic programming approach
can be used
![Page 4: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/4.jpg)
Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin,
whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin.
.
![Page 5: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/5.jpg)
![Page 6: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/6.jpg)
Definition
• Given strings x1, x2 … xk a multiple (global) alignment is a matrix of k rows and A columns where each row represents a sequence and a column contains a symbol from each sequence or gaps symbols (at least one non gap)
![Page 7: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/7.jpg)
Multiple Sequence Alignment
Matrix 3 rows 8 colums
![Page 8: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/8.jpg)
Family representations
• Outcome of multiple alignment• Three kinds
– Profile representation• Frequencies of symbols in each column• Weight vector• Alignment to a profile
– Consensus sequence representation• Steiner string
– Signature representation• PROSITE, BLOCKS databases• Regular expression
![Page 9: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/9.jpg)
Scoring Function
• Ideally:– Find alignment that maximizes probability that
sequences evolved from common ancestorx
yzw
v
?
![Page 10: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/10.jpg)
Multiple Sequence Alignment
• Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods.
• Detection of family characteristics.
Three questions:
1. Scoring
2. Computation of Mult-Seq-Align.
3. Family representation.
![Page 11: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/11.jpg)
A fragment of multiple alignment of 7 kinases.
ClustalW program from SRS server.
![Page 12: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/12.jpg)
Scoring: SP (sum of pairs)
SP – the sum of pairwise scores of all pairs of symbols in the column.
SP3(-,A,A) = (-,A)+(-,A)+(A,A)
SP Total Score = sum over all columns
(-,-) = 0
![Page 13: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/13.jpg)
Induced pairwise alignment
Induced pairwise alignment or projection of a multiple alignment.
a(S1, S2 )
a(S2, S3)
a(S1, S3)
(-,-) = 0
SP Total Score = i<j score[ a(Si, Sj ) ]
![Page 14: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/14.jpg)
Consensus
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
![Page 15: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/15.jpg)
Optimal solution
• Multidimensional Dynamic Programming
• Generalization of pair-wise alignment
• For simplicity, assume k sequences of length n
• The dynamic programming array is k-dimensional hyperlattice of length n+1 (including initial gaps)
• The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]
• Initialize values on the faces of the hyperlattice
![Page 16: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/16.jpg)
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
![Page 17: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/17.jpg)
Sum Of Pairs
• The sum-of-pairs (SP) score of a multiple alignment A is the sum of the scores of all induced pairwise alignments
S(A) = i<j S(Aij)
Aij is the induced alignment of xi, xj
![Page 18: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/18.jpg)
Dyn.Prog. Solution
![Page 19: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/19.jpg)
s (NVNSNA )
s (NVN-NA )
V
S
A
s (N-NSNA)
s (N-NSN- )
s (NVN-N- )s (
NNN ) +δ (
−SA)
s (NVNSNA )=max {
s(NNN )+δ(
VSA )
s (NVN-N- )+δ(
−SA )
s(N-NSN- )+δ(
V−A )
s (N-N-NA )+δ(
VS−)
s (N-NSNA )+δ(
V−−)
s (NVN-NA )+δ(
−S−
)s (
NVNSN- )+δ (
−
−A)
k=3 2k –1=7
![Page 20: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/20.jpg)
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i-1,j-1,k-1)+ S(xi,xj,xk),F(i-1,j-1,k) + S(xi,xj, -),F(i-1,j,k-1) + S(xi,-, xk),F(i-1,j, k) + S(xi,-, -),F(i,j-1,k-1) + S( -,xj,xk),F(i,j-1,k) + S( -,xj,xk),F(i,j,k-1) + S( -,-, xk) }
Multidimensional Dynamic Programming
![Page 21: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/21.jpg)
• Space complexity: O(nk) for k sequences each n long.
• Computing at a cell: O(2k). cost of computing δ.
• Time complexity: O(2knk). cost of computing δ.
• Finding the optimal solution is exponential in k
• Proven to be NP-complete for a number of cost functions
Complexity
![Page 22: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/22.jpg)
• Faster Dynamic Programming (SP)– Carrillo and Lipman 88 (MSA)– Pruning of hyperlattice in DP– Practical for about 6 sequences of length about 200.
• Star alignment (SP)• Progressive methods
– CLUSTALW– PILEUP
• Iterative algorithms• Sampling (Gibbs) based methods• Hidden Markov Model (HMM) based methods• Expectation Maximization Algorithm
Algorithms
![Page 23: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/23.jpg)
• Find pairwise alignment• Trial multiple alignment produced by a tree, cost = d
• This provides a limit to the volume within which optimal alignments are found
• Specifics– Sequences x1,..,xr.
– Alignment A, cost = c(A)
– Optimal alignment A*
– Aij = induced alignment on xi,..,xj on account of A
– D(xi,xj) = cost of optimal pairwise alignment of xi,xj <= c(Aij )
Idea behind MSA algorithm
![Page 24: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/24.jpg)
Progressive Alignment
• Multiple Alignment is NP-complete
• Most used heuristic: Progressive Alignment
Algorithm:
– Align two of the sequences xi, xj
– Fix that alignment
– Align a third sequence xk to the alignment xi,xj
– Repeat until all sequences are aligned
Running Time: O( N L2 )
![Page 25: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/25.jpg)
Star Alignments
• Heuristic method for multiple sequence alignments
• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index
i c, perform a Needleman-Wunsch global alignment
• Aggregate alignments with the principle “once a gap, always a gap.”
![Page 26: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/26.jpg)
Star Alignments Example
s2
s1s3
s4
x1: MPEx2: MKEx3: MSKEx4: SKE
MPE
| |
MKE
MSKE
-||
MKE
SKE
||
MKE MPEMKE
-MPE-MKEMSKE
-MPE-MKEMSKE-SKE
![Page 27: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/27.jpg)
Choosing a center
• Try them all and pick the one with the best score
• Calculate all O(k2) alignments, and pick the sequence xc that minimizes
D(xc,xi)
• D(xc,xi) = c(Aci), A is the multiple alignmenti > c
![Page 28: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/28.jpg)
Analysis
• Assuming all sequences have length n• O(k2n2) to calculate center• Step i takes O((i.n).n) time
– two strings of length n and i.n
• O(k2n2) overall cost• Produces multiple sequence alignments
whose SP values are at most twice that of the optimal solutions, provided triangle inequality holds.
![Page 29: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/29.jpg)
• Profile– Apply dynamic programming
– Score depends on the profile
• Consensus string– Apply dynamic programming
• Signature representations– Align to regular expressions / CFG/ …
Aligning to family representations
![Page 30: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/30.jpg)
Progressive alignment(CLUSTALW)
• CLUSTALW is the most popular multiple protein alignment
Algorithm:1. Find all dij: alignment dist (xi, xj)
2. Construct a tree(Neighbor-joining hierarchical clustering)
3. Align nodes in order of decreasing similarity• sequence to sequence• sequence to profile• profile to profile
+ a large number of heuristics
![Page 31: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/31.jpg)
S1
S2
S3
S4
S1 S2 S3 S4
S1 4 9 4
S2 4 7S3 4
S4
All PairwiseAlignments
S1
S3
S2
S4
Distance
Cluster Analysis
Similarity Matrix Dendrogram
Multiple Alignment Step:1. Aligning S
1 and S
32. Aligning S
2 and S
43. Aligning (S
1,S
3) with (S
2,S
4).
From Higgins(1991) and Thompson(1994).
![Page 32: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/32.jpg)
Problems with Progressive Alignments
• Depends on pairwise alignments
• If sequences are very distantly related, much higher likelihood of errors
• Care must be made in choosing scoring matrices and penalties
![Page 33: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/33.jpg)
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment
Algorithm:
• Find all dij: alignment dist (xi, xj)
• Construct a tree
(Neighbor-joining hierarchical clustering)
• Align nodes in order of decreasing similarity
+ a large number of heuristics
![Page 34: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/34.jpg)
Iterative Refinement
One problem of progressive alignment:
• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
![Page 35: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/35.jpg)
Iterative Refinement
Algorithm (Barton-Stenberg):
• Align most similar xi, xj
• Align xk most similar to (xixj)• Repeat 2 until (x1…xN) are aligned
• For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN
• Repeat 4 until convergence
Note: Guaranteed to converge
![Page 36: Multiple Sequence Alignments - unibo.it · • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which](https://reader036.vdocuments.us/reader036/viewer/2022071000/5fbc1f6c8bf3a307fe0bacc4/html5/thumbnails/36.jpg)
Other methods
• MEME (Expectation Maximization)
• GibbsDNA (Gibbs Sampling)
• HMMER (Hidden Markov Model)
• Random projections
• CONSENUS (greedy multiple alignment)
• WINNOWER (Clique finding in graphs)