protein multiple sequence alignment sarah aerni cs374 december 7, 2006
Post on 21-Dec-2015
215 views
TRANSCRIPT
Protein Multiple Sequence Alignment
Sarah Aerni
CS374
December 7, 2006
Background we’ve seen before Alignment of sequences allows us to examine
homologous regions Two proteins with regions of high sequence
similarity are likely to perform the same function Conserved regions point to structural similarity
Multiple Sequence Alignment
Images from STRAP
Aligned regions represent spatial similarity
Images from STRAP
Background we’ve seen before Alignment of sequences allows us to examine
homologous regions Two proteins with regions of high sequence
similarity are likely to perform the same function Conserved regions point to structural similarity
Evolutionary history can be inferred from similarity Aligned residues should have evolved form the
same ancestral residue
Recap on alignments
Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model
evolutionary events
Sequence alignment revisitedTwo sequences of length L require O(NL2) space and O(3L2) time**
**different algorithms may alter complexity
Time complexity is O(3L2) = O(L2)**
Recap on alignments
Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model
evolutionary events Multiple sequence alignment approaches
Why is the classic pairwise alignment not extendable to multiple sequences?
Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) time
Three sequences of length L require O(L3) space and O(L3) time.
Image from Durbin et al
Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) timeThree sequences of length L require O(L3) space and O(L3) time
Four sequences?
N sequences?
Image from Durbin et al.
Generally time is O(LN)
Run-time for the calculations Let’s assume we have N sequences of length L
Time complexity is O(LN) Assume this computation takes (10)2N-4 seconds
2 sequences take 1 second
3 sequences take 10 seconds
In our example they had N=12 sequences
102*12-4 = 1020 seconds
3 trillion years!!
Solutions
Heuristic approaches to sequence alignment Progressive multiple alignment
Progressive multiple alignment
Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5
1 -.25 1 1 1 1
Total Score: 4.75
Progressive multiple alignment
Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5
Total Score: 0.5
Progressive multiple alignment
Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5
Total Score: 3.5
Progressive multiple alignment
Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5
Total Score: 4.75
Progressive multiple alignment
Create guide tree from pairwise alignments
Use tree to build multiple sequence alignment
Align most similar sequences first (give the most reliable alignments)
Align the profile to the next closest sequence
Progressive multiple alignment
Create guide tree from pairwise alignments
Use tree to build multiple sequence alignment
Align most similar sequences first (give the most reliable alignments)
Align the profile to the next closest sequence
Progressive multiple alignment
Create guide tree from pairwise alignments
Use tree to build multiple sequence alignment
Align most similar sequences first (give the most reliable alignments)
Align the profile to the next closest sequence
Align profiles to each other
Multiple sequence alignment will be at the root of the tree
Progressive multiple alignment
ProbCons
Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true
biological alignment
1. Create posterior probability matrix
2. Compute expected accuracies Determine the number of correctly aligned pairs
3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-
compute match quality score
4. Compute guide tree Hierarchical clustering by expected accuracies
5. Progressive alignment using guide tree
Posterior Probabilities
Use pair-HMM for sequence alignment to compute the probability that letter xi and yj are paired in the true alignment
Image from Do et al
Posterior Probabilities
Can be represented as 3 matrices•Match state (can only move diagonal•Insertion x (only i can increase)•Insertion y (only j can increase)
Transitionprobability
Emissionprobability
Transitionprobability
Posterior Probabilities
The probability of any unique alignment a can be computed as follows
n
iii
n
iiii sosssyxa
1
1
11 )|(()(),|( P
π(s)=probability of starting in state sα(sisi+1)=transition probabilityβ(oi|si)=emission probability of oi in state si
Posterior Probabilities
Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)
Aa
jiji ayxyxayxayx }~{),|(),|*~( 1PP
Many paths exist through xi and yj whose probabilities sum to .35
Path a2 (the most probable path) has probability of .08
The probability of all paths which align xi and yj make the alignment of these two residues very likelySome other path a2 may be the most probable path, however no single pair in its path scores as high as xi and yj
Posterior Probabilities
Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)
Aa
jiji ayxyxayxayx }~{),|(),|*~( 1PP
}~{ ayx ji 1 Evaluates to 1 when xi and yj are aligned in a, 0 otherwise
•Therefore, the probability two residues are aligned is increased by having them appear in alignments presumed to be more probable.
ProbCons
Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true
biological alignment
1. Create posterior probability matrix
2. Compute expected accuracies Determine the number of correctly aligned pairs
3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-
compute match quality score
4. Compute guide tree Hierarchical clustering by expected accuracies
5. Progressive alignment using guide tree
Maximal Expected accuracy alignment
Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch
scores Gap penalties are set to 0
•Imagine increasingly green squares represent aligned residues whose probability approaches 1 and increasingly red approaches 0.
•We want to maximize the overall probability of the path by taking the “greenest” path.
Maximal Expected accuracy alignment
Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch
scores Gap penalties are set to 0
ayx
jia
ji
yxayxyx
yxaaaccuracyE~
* ),|*~(},min{
1),|*),(( P
Alignment computed to maximize most probable matches Finding exactly correct alignment is difficult and not crucial Maximizes the number of correctly aligned residues
ProbCons
Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true
biological alignment
1. Create posterior probability matrix
2. Compute expected accuracies Determine the number of correctly aligned pairs
3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-
compute match quality score
4. Compute guide tree Hierarchical clustering by expected accuracies
5. Progressive alignment using guide tree
Consistency-based scheme
Sequence x
Sequence z
Sequence y
zk
xi
yj
Take location k in sequence z
zk aligns with location i in sequence x
zk aligns with location j in sequence y
Consistency-based scheme
Sequence x
Sequence z
Sequence y
zk
xi
yj
In the ProbCons consistency-based scheme, the alignment of xi to yj will receive a high score.
Probabilistic consistency transformation
Given a set of sequences, S, we can compute
Remove all values in Pxz and Pzy that are below a certain threshold
Then we obtain the probability of residues xi and yi being aligned given the set of all sequences
)|*~( Sayx ji P
Sz z
jkki
k
yzayzzxazxS
),|*~(),|*~(1
PP
P(xi~xj|x) is 1 if i=j, 0 otherwise
ProbCons
Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true biological
alignment
1. Create posterior probability matrix
2. Compute expected accuracies Determine the number of correctly aligned pairs
3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-
compute match quality score
4. Compute guide tree Hierarchical clustering by expected accuracies
5. Progressive alignment using guide tree Sum-of-pairs mode following guide tree
Results BAliBASE benchmark data
141 reference protein alignments hand-constructed alignments structural alignments
5 reference sets with varying degrees of similarity Scored accuracy for 5 aligners and ProbCons
Sum-of-pairs (SP) – number of correctly aligned residue pairs divided by total number in reference set
Column Score (CS) – number of correctly aligned columns divide by total number of aligned columns in reference set
Additional (ProbCons-ext) extends HMM to model long terminal insertions in x or y
Results
ProbCons shows high column reliability in actual homology regions
Only core blocks in BAliBASE alignment are considered “actual” ProbCons may be
detecting true homology outside core regions
Column reliability computed as proportion of correct pairwise matches
ResultsComparative Analysis
ProbCons outscores all other aligners in every benchmark dataset
Runtime is moderate compared to others When ProbCons-ext is the top-scoring method, ProbCons is
second in most cases Exception Reference set 4: “sequences with large N/C-terminal
extensions”
Problems with insertions
Insertions are penalized multiple times even though the event occurs only once!
Problems with insertions
Initial set of sequences
Sequences are organized into tree
Problems with insertions
Pairwise sequence alignments are performed between evolutionarily closest sequences
Problems with insertionsThe top sequence contains an insertion and a gap penalty is incurred
Gap penalty incurred for the alignment
With two sequences it is not clear whether the gap represents an insertion or a deletion!
This sequence may have undergone a
deletion event(two Ts were removed)
This sequence may have undergone an
insertion event(two Ts were inserted)
Problems with insertions
In the most parsimonious explanation, this gap represents a deletion in the middle sequence (it is not present in any other sequence)
Will be scored as 1 pairwise match and two gaps
Problems with insertions
If we examine the sequences in a tree, we can see the deletion occurred once, and has been scored accordingly.
Problems with insertions
Deletion occurs in this branch
Problems with insertions
Insertion occurs in this branchLet’s examine an insertion event
Problems with insertions
In the most parsimonious explanation, this gap represents an insertion in the top sequence
Will be scored as two gapsAligned gaps receive no match score!
Problems with insertions
The scoring issues grow with an increased number of sequences!
Problems with insertions
Although this alignment represents only one insertion event, it is penalized for n-1 gaps (where n is the number of sequences) and receives no match score
Problems with insertions
Although both alignments represent the same number of events, they will be scored differently.
Single deletion eventn-1 gap penalties and 4C2 match scores
Single insertion event:n-1 gap penalties
Other programs
Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW
lowers gap penalties as multiple gaps build up in one region
Infers long ancestral sequences Every insertion is modeled as original sequence!
First gap penalized as usualSubsequent gaps exhibit lower gap penalty
Other programs
Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW
lowers gap penalties as multiple gaps build up in one region
Infers long ancestral sequences Every insertion is modeled as original sequence!
Löytynoja et al.
Propose skipping inserted subsequences after already being aligned
Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions
Use an evolutionary scoring function All transition states are described by indel rates and
evolutionary distances Character emission is described by an “evolutionary
substitution” model Various models can be incorporated
Progressive alignment
Algorithm to allow pre-existing gaps to be skipped Insertion has already
been penalized in the top profile
Keep track of pointers for all previous insertions
Allow “free ride” from previous insertion during an alignment
AC in both sequences aligned as usual
“Free ride” is given over insertion in child branch-no penalty
The remaining alignment is scored regularly
Löytynoja et al.
Propose skipping inserted subsequences after already being aligned
Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions
Use an evolutionary scoring function All transition states are described by indel rates and
evolutionary distances Character emission is described by an “evolutionary
substitution” model Various models can be incorporated
Addition of matrices•Matrices are added for previous insertions as pointers to
•insertion in current states•match states
•As with all recurrences the best scoring path is taken
•No additional time complexity
Addition of matrices•Matrices are added for previous insertions as pointers to
•insertion in current states•match states
•As with all recurrences the best scoring path is taken
•No additional time complexity
Addition of matrices•Matrices are added for previous insertions as pointers to
•insertion in current states•match states
•As with all recurrences the best scoring path is taken
•No additional time complexity
Score at beginning of previous insertion
Score as a regular insertion-penalty incurred
Score as a previous insertion- no additional penalty incurred
Löytynoja et al.
Propose skipping inserted subsequences after already being aligned
Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions
Use an evolutionary scoring function All transition states are described by indel rates and
evolutionary distances Character emission is described by an “evolutionary
substitution” model Various models can be incorporated
Probabilistic alignment Goal is to determine
the ancestral sequence
Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that
the the ith position of sequence x has residue a (simply a Profile)
For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others
A 1 0 0 0 0 0
C 0 0 0 0 0 0
G 0 1 0 0 1 0
T 0 0 1 1 0 1
Probabilistic alignment Goal is to determine
the ancestral sequence
Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that
the the ith position of sequence x has residue a (simply a Profile)
For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others
A .5 0 0 0 0 0
C .5 .5 0 0 0 0
G 0 .5 0 0 1 1
T 0 0 1 1 0 0
Probabilistic scoring function
Authors describe the conditional probability that residue a is in position k in the z, the ancestral sequence
- Substitution probability between and b given theevolutionary distance between x and its ancestor- The probability that the the ith position of sequence x has residue b
The authors define a normalized evolutionary score for matching residues at location xi and yj
Equilibrium frequency of character a
Results
Study 20 primate mitochondrial D-loop sequences
The authors’ method produced phylogenetically consistent gaps
Regions deemed indel “hot spots” by CLUSTALW are likely artifacts of the method
CLUSTALW
Authors’ method
ResultsGaps are consistent with the phylogenetic tree
CLUSTALW-artifacts?Gaps are largely inconsistent with the phylogenetic tree
References Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou,
S. 2005. PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15: 330-340.
Löytynoja, A., Goldman, N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 30:10557-10562
Durbin, R., Eddy, S., Krogh, A., Mitchison, G. 1998. Biological sequence analysis. Cambridge University Press, Cambridge, UK.
http://bioalgorithms.info Gille, C., Frommel, C. STRAP: editor for STRuctural Alignments
of Proteins
Sequence homology in a tree
Alignments can be represented in a tree Evolutionary information contained in branch
lengths and tree organization Special problems for sequence insertions being
carried up the tree Trees are used to guide multiple sequence
alignments which may be inferred by a program provided as input to program