protein multiple sequence alignment sarah aerni cs374 december 7, 2006

Protein Multiple Sequence Alignment

Sarah Aerni

CS374

December 7, 2006

Background we’ve seen before Alignment of sequences allows us to examine

homologous regions Two proteins with regions of high sequence

similarity are likely to perform the same function Conserved regions point to structural similarity

Multiple Sequence Alignment

Images from STRAP

Aligned regions represent spatial similarity

Images from STRAP

Background we’ve seen before Alignment of sequences allows us to examine

homologous regions Two proteins with regions of high sequence

similarity are likely to perform the same function Conserved regions point to structural similarity

Evolutionary history can be inferred from similarity Aligned residues should have evolved form the

same ancestral residue

Recap on alignments

Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model

evolutionary events

Sequence alignment revisitedTwo sequences of length L require O(NL2) space and O(3L2) time**

**different algorithms may alter complexity

Time complexity is O(3L2) = O(L2)**

Recap on alignments

Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model

evolutionary events Multiple sequence alignment approaches

Why is the classic pairwise alignment not extendable to multiple sequences?

Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) time

Three sequences of length L require O(L3) space and O(L3) time.

Image from Durbin et al

Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) timeThree sequences of length L require O(L3) space and O(L3) time

Four sequences?

N sequences?

Image from Durbin et al.

Generally time is O(LN)

Run-time for the calculations Let’s assume we have N sequences of length L

Time complexity is O(LN) Assume this computation takes (10)2N-4 seconds

2 sequences take 1 second

3 sequences take 10 seconds

In our example they had N=12 sequences

102*12-4 = 1020 seconds

3 trillion years!!

Solutions

Heuristic approaches to sequence alignment Progressive multiple alignment

Progressive multiple alignment

Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5

1 -.25 1 1 1 1

Total Score: 4.75



Total Score: 0.5



Total Score: 3.5



Total Score: 4.75


Create guide tree from pairwise alignments

Use tree to build multiple sequence alignment

Align most similar sequences first (give the most reliable alignments)

Align the profile to the next closest sequence


Create guide tree from pairwise alignments

Use tree to build multiple sequence alignment

Align most similar sequences first (give the most reliable alignments)

Align the profile to the next closest sequence

Align profiles to each other

Multiple sequence alignment will be at the root of the tree

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true

biological alignment

1. Create posterior probability matrix

2. Compute expected accuracies Determine the number of correctly aligned pairs

3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-

compute match quality score

4. Compute guide tree Hierarchical clustering by expected accuracies

5. Progressive alignment using guide tree

Posterior Probabilities

Use pair-HMM for sequence alignment to compute the probability that letter xi and yj are paired in the true alignment

Image from Do et al


Can be represented as 3 matrices•Match state (can only move diagonal•Insertion x (only i can increase)•Insertion y (only j can increase)

Transitionprobability

Emissionprobability

Transitionprobability


The probability of any unique alignment a can be computed as follows

n

iii

n

iiii sosssyxa

1

1

11 )|(()(),|( P

π(s)=probability of starting in state sα(sisi+1)=transition probabilityβ(oi|si)=emission probability of oi in state si


Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)

Aa

jiji ayxyxayxayx }~{),|(),|*~( 1PP

Many paths exist through xi and yj whose probabilities sum to .35

Path a2 (the most probable path) has probability of .08

The probability of all paths which align xi and yj make the alignment of these two residues very likelySome other path a2 may be the most probable path, however no single pair in its path scores as high as xi and yj


Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)

Aa

jiji ayxyxayxayx }~{),|(),|*~( 1PP

}~{ ayx ji 1 Evaluates to 1 when xi and yj are aligned in a, 0 otherwise

•Therefore, the probability two residues are aligned is increased by having them appear in alignments presumed to be more probable.

ProbCons









Maximal Expected accuracy alignment

Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch

scores Gap penalties are set to 0

•Imagine increasingly green squares represent aligned residues whose probability approaches 1 and increasingly red approaches 0.

•We want to maximize the overall probability of the path by taking the “greenest” path.

Maximal Expected accuracy alignment

Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch

scores Gap penalties are set to 0

ayx

jia

ji

yxayxyx

yxaaaccuracyE~

* ),|*~(},min{

1),|*),(( P

Alignment computed to maximize most probable matches Finding exactly correct alignment is difficult and not crucial Maximizes the number of correctly aligned residues

ProbCons









Consistency-based scheme

Sequence x

Sequence z

Sequence y

zk

xi

yj

Take location k in sequence z

zk aligns with location i in sequence x

zk aligns with location j in sequence y

Consistency-based scheme

Sequence x

Sequence z

Sequence y

zk

xi

yj

In the ProbCons consistency-based scheme, the alignment of xi to yj will receive a high score.

Probabilistic consistency transformation

Given a set of sequences, S, we can compute

Remove all values in Pxz and Pzy that are below a certain threshold

Then we obtain the probability of residues xi and yi being aligned given the set of all sequences

)|*~( Sayx ji P

Sz z

jkki

k

yzayzzxazxS

),|*~(),|*~(1

PP

P(xi~xj|x) is 1 if i=j, 0 otherwise

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true biological

alignment






5. Progressive alignment using guide tree Sum-of-pairs mode following guide tree

Results BAliBASE benchmark data

141 reference protein alignments hand-constructed alignments structural alignments

5 reference sets with varying degrees of similarity Scored accuracy for 5 aligners and ProbCons

Sum-of-pairs (SP) – number of correctly aligned residue pairs divided by total number in reference set

Column Score (CS) – number of correctly aligned columns divide by total number of aligned columns in reference set

Additional (ProbCons-ext) extends HMM to model long terminal insertions in x or y

Results

ProbCons shows high column reliability in actual homology regions

Only core blocks in BAliBASE alignment are considered “actual” ProbCons may be

detecting true homology outside core regions

Column reliability computed as proportion of correct pairwise matches

ResultsComparative Analysis

ProbCons outscores all other aligners in every benchmark dataset

Runtime is moderate compared to others When ProbCons-ext is the top-scoring method, ProbCons is

second in most cases Exception Reference set 4: “sequences with large N/C-terminal

extensions”

Problems with insertions

Insertions are penalized multiple times even though the event occurs only once!


Initial set of sequences

Sequences are organized into tree


Pairwise sequence alignments are performed between evolutionarily closest sequences

Problems with insertionsThe top sequence contains an insertion and a gap penalty is incurred

Gap penalty incurred for the alignment

With two sequences it is not clear whether the gap represents an insertion or a deletion!

This sequence may have undergone a

deletion event(two Ts were removed)

This sequence may have undergone an

insertion event(two Ts were inserted)


In the most parsimonious explanation, this gap represents a deletion in the middle sequence (it is not present in any other sequence)

Will be scored as 1 pairwise match and two gaps


If we examine the sequences in a tree, we can see the deletion occurred once, and has been scored accordingly.


Deletion occurs in this branch


Insertion occurs in this branchLet’s examine an insertion event


In the most parsimonious explanation, this gap represents an insertion in the top sequence

Will be scored as two gapsAligned gaps receive no match score!


The scoring issues grow with an increased number of sequences!


Although this alignment represents only one insertion event, it is penalized for n-1 gaps (where n is the number of sequences) and receives no match score


Although both alignments represent the same number of events, they will be scored differently.

Single deletion eventn-1 gap penalties and 4C2 match scores

Single insertion event:n-1 gap penalties

Other programs

Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW

lowers gap penalties as multiple gaps build up in one region

Infers long ancestral sequences Every insertion is modeled as original sequence!

First gap penalized as usualSubsequent gaps exhibit lower gap penalty

Other programs

Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW

lowers gap penalties as multiple gaps build up in one region

Infers long ancestral sequences Every insertion is modeled as original sequence!

Löytynoja et al.

Propose skipping inserted subsequences after already being aligned

Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions

Use an evolutionary scoring function All transition states are described by indel rates and

evolutionary distances Character emission is described by an “evolutionary

substitution” model Various models can be incorporated

Progressive alignment

Algorithm to allow pre-existing gaps to be skipped Insertion has already

been penalized in the top profile

Keep track of pointers for all previous insertions

Allow “free ride” from previous insertion during an alignment

AC in both sequences aligned as usual

“Free ride” is given over insertion in child branch-no penalty

The remaining alignment is scored regularly

Löytynoja et al.






Addition of matrices•Matrices are added for previous insertions as pointers to

•insertion in current states•match states

•As with all recurrences the best scoring path is taken

•No additional time complexity

Addition of matrices•Matrices are added for previous insertions as pointers to

•insertion in current states•match states

•As with all recurrences the best scoring path is taken

•No additional time complexity

Score at beginning of previous insertion

Score as a regular insertion-penalty incurred

Score as a previous insertion- no additional penalty incurred

Löytynoja et al.






Probabilistic alignment Goal is to determine

the ancestral sequence

Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that

the the ith position of sequence x has residue a (simply a Profile)

For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others

A 1 0 0 0 0 0

C 0 0 0 0 0 0

G 0 1 0 0 1 0

T 0 0 1 1 0 1

Probabilistic alignment Goal is to determine

the ancestral sequence

Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that

the the ith position of sequence x has residue a (simply a Profile)

For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others

A .5 0 0 0 0 0

C .5 .5 0 0 0 0

G 0 .5 0 0 1 1

T 0 0 1 1 0 0

Probabilistic scoring function

Authors describe the conditional probability that residue a is in position k in the z, the ancestral sequence

- Substitution probability between and b given theevolutionary distance between x and its ancestor- The probability that the the ith position of sequence x has residue b

The authors define a normalized evolutionary score for matching residues at location xi and yj

Equilibrium frequency of character a

Results

Study 20 primate mitochondrial D-loop sequences

The authors’ method produced phylogenetically consistent gaps

Regions deemed indel “hot spots” by CLUSTALW are likely artifacts of the method

CLUSTALW

Authors’ method

ResultsGaps are consistent with the phylogenetic tree

CLUSTALW-artifacts?Gaps are largely inconsistent with the phylogenetic tree

References Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou,

S. 2005. PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15: 330-340.

Löytynoja, A., Goldman, N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 30:10557-10562

Durbin, R., Eddy, S., Krogh, A., Mitchison, G. 1998. Biological sequence analysis. Cambridge University Press, Cambridge, UK.

http://bioalgorithms.info Gille, C., Frommel, C. STRAP: editor for STRuctural Alignments

of Proteins

Sequence homology in a tree

Alignments can be represented in a tree Evolutionary information contained in branch

lengths and tree organization Special problems for sequence insertions being

carried up the tree Trees are used to guide multiple sequence

alignments which may be inferred by a program provided as input to program

protein multiple sequence alignment sarah aerni cs374 december 7, 2006

Documents

alignment of sequences

multiple sequences

ol n slide

closest sequence slide

n sequences of length

strap slide

similar sequences

structural similarity