protein multiple sequence alignment sarah aerni cs374 december 7, 2006

66
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Protein Multiple Sequence Alignment

Sarah Aerni

CS374

December 7, 2006

Page 2: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Background we’ve seen before Alignment of sequences allows us to examine

homologous regions Two proteins with regions of high sequence

similarity are likely to perform the same function Conserved regions point to structural similarity

Page 3: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Multiple Sequence Alignment

Images from STRAP

Page 4: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Aligned regions represent spatial similarity

Images from STRAP

Page 5: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Background we’ve seen before Alignment of sequences allows us to examine

homologous regions Two proteins with regions of high sequence

similarity are likely to perform the same function Conserved regions point to structural similarity

Evolutionary history can be inferred from similarity Aligned residues should have evolved form the

same ancestral residue

Page 6: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Recap on alignments

Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model

evolutionary events

Page 7: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Sequence alignment revisitedTwo sequences of length L require O(NL2) space and O(3L2) time**

**different algorithms may alter complexity

Time complexity is O(3L2) = O(L2)**

Page 8: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Recap on alignments

Classic pairwise sequence alignment Dynamic programming approaches Use affine gap penalty to more accurately model

evolutionary events Multiple sequence alignment approaches

Why is the classic pairwise alignment not extendable to multiple sequences?

Page 9: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) time

Three sequences of length L require O(L3) space and O(L3) time.

Image from Durbin et al

Page 10: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Sequence alignment revisitedTwo sequences of length L require O(L2) space and O(L2) timeThree sequences of length L require O(L3) space and O(L3) time

Four sequences?

N sequences?

Image from Durbin et al.

Generally time is O(LN)

Page 11: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Run-time for the calculations Let’s assume we have N sequences of length L

Time complexity is O(LN) Assume this computation takes (10)2N-4 seconds

2 sequences take 1 second

3 sequences take 10 seconds

In our example they had N=12 sequences

102*12-4 = 1020 seconds

3 trillion years!!

Page 12: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Solutions

Heuristic approaches to sequence alignment Progressive multiple alignment

Page 13: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5

1 -.25 1 1 1 1

Total Score: 4.75

Page 14: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5

Total Score: 0.5

Page 15: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5

Total Score: 3.5

Page 16: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Perform pairwise alignments for all sequencesAssume a match gives a score of 1, a mismatch is -0.25, indel is -0.5

Total Score: 4.75

Page 17: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Create guide tree from pairwise alignments

Use tree to build multiple sequence alignment

Align most similar sequences first (give the most reliable alignments)

Align the profile to the next closest sequence

Page 18: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Create guide tree from pairwise alignments

Use tree to build multiple sequence alignment

Align most similar sequences first (give the most reliable alignments)

Align the profile to the next closest sequence

Page 19: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Create guide tree from pairwise alignments

Use tree to build multiple sequence alignment

Align most similar sequences first (give the most reliable alignments)

Align the profile to the next closest sequence

Align profiles to each other

Multiple sequence alignment will be at the root of the tree

Page 20: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive multiple alignment

Page 21: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true

biological alignment

1. Create posterior probability matrix

2. Compute expected accuracies Determine the number of correctly aligned pairs

3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-

compute match quality score

4. Compute guide tree Hierarchical clustering by expected accuracies

5. Progressive alignment using guide tree

Page 22: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Posterior Probabilities

Use pair-HMM for sequence alignment to compute the probability that letter xi and yj are paired in the true alignment

Image from Do et al

Page 23: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Posterior Probabilities

Can be represented as 3 matrices•Match state (can only move diagonal•Insertion x (only i can increase)•Insertion y (only j can increase)

Transitionprobability

Emissionprobability

Transitionprobability

Page 24: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Posterior Probabilities

The probability of any unique alignment a can be computed as follows

n

iii

n

iiii sosssyxa

1

1

11 )|(()(),|( P

π(s)=probability of starting in state sα(sisi+1)=transition probabilityβ(oi|si)=emission probability of oi in state si

Page 25: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Posterior Probabilities

Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)

Aa

jiji ayxyxayxayx }~{),|(),|*~( 1PP

Many paths exist through xi and yj whose probabilities sum to .35

Path a2 (the most probable path) has probability of .08

The probability of all paths which align xi and yj make the alignment of these two residues very likelySome other path a2 may be the most probable path, however no single pair in its path scores as high as xi and yj

Page 26: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Posterior Probabilities

Compute the posterior probability that xi and yj are matched in a* (the “true” biological alignment)

Aa

jiji ayxyxayxayx }~{),|(),|*~( 1PP

}~{ ayx ji 1 Evaluates to 1 when xi and yj are aligned in a, 0 otherwise

•Therefore, the probability two residues are aligned is increased by having them appear in alignments presumed to be more probable.

Page 27: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true

biological alignment

1. Create posterior probability matrix

2. Compute expected accuracies Determine the number of correctly aligned pairs

3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-

compute match quality score

4. Compute guide tree Hierarchical clustering by expected accuracies

5. Progressive alignment using guide tree

Page 28: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Maximal Expected accuracy alignment

Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch

scores Gap penalties are set to 0

•Imagine increasingly green squares represent aligned residues whose probability approaches 1 and increasingly red approaches 0.

•We want to maximize the overall probability of the path by taking the “greenest” path.

Page 29: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Maximal Expected accuracy alignment

Use simple Needleman-Wunsch algorithm Use the posterior probabilities as the match and mismatch

scores Gap penalties are set to 0

ayx

jia

ji

yxayxyx

yxaaaccuracyE~

* ),|*~(},min{

1),|*),(( P

Alignment computed to maximize most probable matches Finding exactly correct alignment is difficult and not crucial Maximizes the number of correctly aligned residues

Page 30: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true

biological alignment

1. Create posterior probability matrix

2. Compute expected accuracies Determine the number of correctly aligned pairs

3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-

compute match quality score

4. Compute guide tree Hierarchical clustering by expected accuracies

5. Progressive alignment using guide tree

Page 31: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Consistency-based scheme

Sequence x

Sequence z

Sequence y

zk

xi

yj

Take location k in sequence z

zk aligns with location i in sequence x

zk aligns with location j in sequence y

Page 32: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Consistency-based scheme

Sequence x

Sequence z

Sequence y

zk

xi

yj

In the ProbCons consistency-based scheme, the alignment of xi to yj will receive a high score.

Page 33: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Probabilistic consistency transformation

Given a set of sequences, S, we can compute

Remove all values in Pxz and Pzy that are below a certain threshold

Then we obtain the probability of residues xi and yi being aligned given the set of all sequences

)|*~( Sayx ji P

Sz z

jkki

k

yzayzzxazxS

),|*~(),|*~(1

PP

P(xi~xj|x) is 1 if i=j, 0 otherwise

Page 34: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ProbCons

Attempts to identify a* for all sequence pairs a* is the unknown alignment that best represents the true biological

alignment

1. Create posterior probability matrix

2. Compute expected accuracies Determine the number of correctly aligned pairs

3. Consistency-based transformation Think transitive property of equality using 3rd sequence to re-

compute match quality score

4. Compute guide tree Hierarchical clustering by expected accuracies

5. Progressive alignment using guide tree Sum-of-pairs mode following guide tree

Page 35: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Results BAliBASE benchmark data

141 reference protein alignments hand-constructed alignments structural alignments

5 reference sets with varying degrees of similarity Scored accuracy for 5 aligners and ProbCons

Sum-of-pairs (SP) – number of correctly aligned residue pairs divided by total number in reference set

Column Score (CS) – number of correctly aligned columns divide by total number of aligned columns in reference set

Additional (ProbCons-ext) extends HMM to model long terminal insertions in x or y

Page 36: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Results

ProbCons shows high column reliability in actual homology regions

Only core blocks in BAliBASE alignment are considered “actual” ProbCons may be

detecting true homology outside core regions

Column reliability computed as proportion of correct pairwise matches

Page 37: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ResultsComparative Analysis

ProbCons outscores all other aligners in every benchmark dataset

Runtime is moderate compared to others When ProbCons-ext is the top-scoring method, ProbCons is

second in most cases Exception Reference set 4: “sequences with large N/C-terminal

extensions”

Page 38: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Insertions are penalized multiple times even though the event occurs only once!

Page 39: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Initial set of sequences

Sequences are organized into tree

Page 40: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Pairwise sequence alignments are performed between evolutionarily closest sequences

Page 41: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertionsThe top sequence contains an insertion and a gap penalty is incurred

Gap penalty incurred for the alignment

With two sequences it is not clear whether the gap represents an insertion or a deletion!

This sequence may have undergone a

deletion event(two Ts were removed)

This sequence may have undergone an

insertion event(two Ts were inserted)

Page 42: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

In the most parsimonious explanation, this gap represents a deletion in the middle sequence (it is not present in any other sequence)

Will be scored as 1 pairwise match and two gaps

Page 43: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

If we examine the sequences in a tree, we can see the deletion occurred once, and has been scored accordingly.

Page 44: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Deletion occurs in this branch

Page 45: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Insertion occurs in this branchLet’s examine an insertion event

Page 46: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

In the most parsimonious explanation, this gap represents an insertion in the top sequence

Will be scored as two gapsAligned gaps receive no match score!

Page 47: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

The scoring issues grow with an increased number of sequences!

Page 48: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Although this alignment represents only one insertion event, it is penalized for n-1 gaps (where n is the number of sequences) and receives no match score

Page 49: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Problems with insertions

Although both alignments represent the same number of events, they will be scored differently.

Single deletion eventn-1 gap penalties and 4C2 match scores

Single insertion event:n-1 gap penalties

Page 50: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Other programs

Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW

lowers gap penalties as multiple gaps build up in one region

Infers long ancestral sequences Every insertion is modeled as original sequence!

First gap penalized as usualSubsequent gaps exhibit lower gap penalty

Page 51: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Other programs

Other methods treat all gaps as deletions and use heuristics to correct for repeated gaps Ex: CLUSTALW

lowers gap penalties as multiple gaps build up in one region

Infers long ancestral sequences Every insertion is modeled as original sequence!

Page 52: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Löytynoja et al.

Propose skipping inserted subsequences after already being aligned

Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions

Use an evolutionary scoring function All transition states are described by indel rates and

evolutionary distances Character emission is described by an “evolutionary

substitution” model Various models can be incorporated

Page 53: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Progressive alignment

Algorithm to allow pre-existing gaps to be skipped Insertion has already

been penalized in the top profile

Keep track of pointers for all previous insertions

Allow “free ride” from previous insertion during an alignment

AC in both sequences aligned as usual

“Free ride” is given over insertion in child branch-no penalty

The remaining alignment is scored regularly

Page 54: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Löytynoja et al.

Propose skipping inserted subsequences after already being aligned

Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions

Use an evolutionary scoring function All transition states are described by indel rates and

evolutionary distances Character emission is described by an “evolutionary

substitution” model Various models can be incorporated

Page 55: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Addition of matrices•Matrices are added for previous insertions as pointers to

•insertion in current states•match states

•As with all recurrences the best scoring path is taken

•No additional time complexity

Page 56: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Addition of matrices•Matrices are added for previous insertions as pointers to

•insertion in current states•match states

•As with all recurrences the best scoring path is taken

•No additional time complexity

Page 57: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Addition of matrices•Matrices are added for previous insertions as pointers to

•insertion in current states•match states

•As with all recurrences the best scoring path is taken

•No additional time complexity

Score at beginning of previous insertion

Score as a regular insertion-penalty incurred

Score as a previous insertion- no additional penalty incurred

Page 58: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Löytynoja et al.

Propose skipping inserted subsequences after already being aligned

Implements affine gap pairwise alignments HMM with states similar to those in ProbCons Introduce matrices which store previous insertions

Use an evolutionary scoring function All transition states are described by indel rates and

evolutionary distances Character emission is described by an “evolutionary

substitution” model Various models can be incorporated

Page 59: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Probabilistic alignment Goal is to determine

the ancestral sequence

Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that

the the ith position of sequence x has residue a (simply a Profile)

For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others

A 1 0 0 0 0 0

C 0 0 0 0 0 0

G 0 1 0 0 1 0

T 0 0 1 1 0 1

Page 60: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Probabilistic alignment Goal is to determine

the ancestral sequence

Sequences consist of vectors of probabilities for all residues at each position pa(xi) = the probability that

the the ith position of sequence x has residue a (simply a Profile)

For the input sequences, the vectors have probability 1 assigned to the observed character at each position, and 0 to all others

A .5 0 0 0 0 0

C .5 .5 0 0 0 0

G 0 .5 0 0 1 1

T 0 0 1 1 0 0

Page 61: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Probabilistic scoring function

Authors describe the conditional probability that residue a is in position k in the z, the ancestral sequence

- Substitution probability between and b given theevolutionary distance between x and its ancestor- The probability that the the ith position of sequence x has residue b

The authors define a normalized evolutionary score for matching residues at location xi and yj

Equilibrium frequency of character a

Page 62: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Results

Study 20 primate mitochondrial D-loop sequences

The authors’ method produced phylogenetically consistent gaps

Regions deemed indel “hot spots” by CLUSTALW are likely artifacts of the method

CLUSTALW

Authors’ method

Page 63: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

ResultsGaps are consistent with the phylogenetic tree

Page 64: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

CLUSTALW-artifacts?Gaps are largely inconsistent with the phylogenetic tree

Page 65: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

References Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou,

S. 2005. PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15: 330-340.

Löytynoja, A., Goldman, N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 30:10557-10562

Durbin, R., Eddy, S., Krogh, A., Mitchison, G. 1998. Biological sequence analysis. Cambridge University Press, Cambridge, UK.

http://bioalgorithms.info Gille, C., Frommel, C. STRAP: editor for STRuctural Alignments

of Proteins

Page 66: Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006

Sequence homology in a tree

Alignments can be represented in a tree Evolutionary information contained in branch

lengths and tree organization Special problems for sequence insertions being

carried up the tree Trees are used to guide multiple sequence

alignments which may be inferred by a program provided as input to program