aminoacid+alignment including pam & blosum

38
Amino acid substitution matrices Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution Scoring matrices reflect: probabilities of mutual substitutions the probability of occurrence of each amino acid Widely used scoring matrices: PAM BLOSUM

Upload: shiv1987

Post on 18-Nov-2014

834 views

Category:

Documents


1 download

DESCRIPTION

its a part of biostatics & bioinformatics..........

TRANSCRIPT

Page 1: Aminoacid+Alignment including PAM & BLOSUM

Amino acid substitution matrices

• Amino acids have different biochemical and physical

properties that influence their relative replaceability in

evolution

• Scoring matrices reflect:

– probabilities of mutual substitutions

– the probability of occurrence of each amino acid

• Widely used scoring matrices:

– PAM

– BLOSUM

Page 2: Aminoacid+Alignment including PAM & BLOSUM

Amino acid substitution matrices

• Certain amino acid substitutions commonly occur in

related proteins from different species.

• Because, a protein still functions with these

substitutions, the substituted amino acids are

compatible with structure and function.

• Knowing types of changes that are most and least

common in a large number of proteins can assist with

predicting alignments for any set of protein

sequences.

• If ancestor relationships among a group of proteins

are assessed, the most likely amino acid changes that

occurred during evolution can be predicted.

Page 3: Aminoacid+Alignment including PAM & BLOSUM

Point Accepted Mutation (PAM) Matrices

[Dayhoff substitution matrices]

The first systematic method to derive amino acid substitution matrices was done by Dayhoff et al. (1978) Atlas of Protein Structure. These widely used substitution matrices are frequently called Dayhoff, MDM (Mutation Data Matrix), or PAM (Percent Accepted Mutation) matrices.

PAM approach: estimate the probability that b was substituted for a in a given measure of evolutionary distance.

KEY IDEA: trusted alignments of closely related sequences provide information about biologically permissible mutations.

Page 4: Aminoacid+Alignment including PAM & BLOSUM

Point Accepted Mutation (PAM) Matrices

[Dayhoff substitution matrices]

• This family of matrices lists the likelyhood of change from one amino acid to another in homologous protein sequences during evolution.

• Each matrix gives the changes expected for a given period of evolutionary time, evidenced by decreased sequence similarity as genes that encoded the same protein diverge with increased evolutionary time.

• This leads to two possibilities:

– One matrix gives the changes expected in homologous proteins that have diverged only a small amount from each other in a relatively short period of time (about 50% similar)

– Other matrix gives changes expected of proteins that have diverged over a much longer period, leaving only 20% similarity.

Page 5: Aminoacid+Alignment including PAM & BLOSUM

…How PAM matrix is derived• In deriving the PAM matrices, each change in the

current amino acid at a particular site is assumed to

be independent of previous mutational events at that

site

• Thus, the probability of change of any amino acid ‘a’

to amino acid ‘b’ is the same, regardless of the

position of amino acid ‘a’ in a sequence.

– Based on Markov model (simple) which is

characterized by a series of changes of state in a

system such that a change from one state to

another does not depend on the previous history

of the state.

Page 6: Aminoacid+Alignment including PAM & BLOSUM

How PAM matrix is derived.. AA index

• To prepare the Dayhoff PAM matrices (Dayhoff 1978),

amino acid substitutions that occurred in a group of

evolving proteins were estimated using 1572 changes

in 71 groups of protein sequences that were atleast

85% similar.

• Because these changes were observed in closely

related proteins (>85% similar), they represented

amino acid substitutions that do not significantly

change the function of protein

• …. Hence called as “accepted mutations” – defined

as amino acid changes accepted by natural selection

Page 7: Aminoacid+Alignment including PAM & BLOSUM

…How PAM matrix is derived• To develop a single-letter code for the amino acids, Dr.

Dayhoff attempted to make the code as easy to remember as

possible.  Of course, if the name of each amino acid began

with a different letter, the code would be simple indeed.  For 6

of the amino acids, the first letter of the name is unique,

making the code simple.

• Cystine Cys C (First letter)

• For the other amino acids, the first letter of the name is not

unique to a single amino acid, so Dr. Dayhoff assigned the

letters A, G, L, P and T to the amino acids Alanine, Glycine,

Leucine, Proline and Threonine, respectively, which occur

more frequently in proteins than do the other amino acids

having the same first letters.

Page 8: Aminoacid+Alignment including PAM & BLOSUM

…How PAM matrix is derived• Some of the other amino acids are phonetically suggestive.

Arginine R aRginine

• For the remaining 5 amino acids, Dr. Dayhoff was reaching

somewhat to find an easy-to-remember connection between

the single letter and the amino acid.  She assigned aspartic

acid, asparagine, glutamic acid and glutamine the letters D, N,

E and Q, respectively, noting that D and N are nearer the

beginning of the alphabet than E and Q, and that Asp is

smaller than Glu, while Asn is smaller than Gln.

• By the time Dr. Dayhoff got to lysine, there were not too many

letters left, so she used the letter K, explaining that K is at

least near L in the alphabet.

Page 9: Aminoacid+Alignment including PAM & BLOSUM

…How PAM matrix is derived

Page 10: Aminoacid+Alignment including PAM & BLOSUM

First step: Pair Exchange Frequencies

• In order to identify accepted point mutations, a complete

phylogenetic tree including all ancestral sequences has to

be constructed.

• To avoid a large degree of ambiguities in this step, Dayhoff

and colleagues restricted their analysis to sequence

families with more than 85% identity.

:A PAM (Percent accepted mutation) is one accepted point mutation on the path between two sequences, per 100 residues.

Page 11: Aminoacid+Alignment including PAM & BLOSUM

First step: Pair Exchange Frequencies

• For each of the observed and inferred sequences, the

amino acid pair exchanges are tabulated into a 20x20

matrix. It is assumed, that the likelihood of an amino-acid X

being replaced by an amino acid Y is the same as Y

replacing X. Hence the matrix is constructed

symmetrically.

• Aij is the number of accepted mutations observed where

amino acid i replaces amino acid j.

Page 12: Aminoacid+Alignment including PAM & BLOSUM

Second step: Frequencies of Occurence

•If the properties of amino acids differ and if they occur with

different frequencies, all statements we can make about the

average properties of sequences will depend on the

frequencies of occurrence of the individual amino acids.

•These frequencies of occurrence are approximated by the

frequencies of observation.

•They are the number of occurences of a given amino acid

divided by the number of amino-acids observed.

Page 13: Aminoacid+Alignment including PAM & BLOSUM

Third step: Relative Mutabilities

•Relative mutabilities are evaluated by counting, in each

group of related sequences, the number of changes of

each amino acid and by dividing this number by a

normalizing factor.

•This factor is a product of the frequency of occurrence of

the amino acid in that group of sequences being analyzed

Page 14: Aminoacid+Alignment including PAM & BLOSUM

Third step: Relative Mutabilities

Aligned sequences A D A A D B

Amino acids A B DObserved Changes 1 1 0Frequency of Occurrence 3 1 2(in total composition)

RELATIVE MUTABILITY 0.33 1 0

Page 15: Aminoacid+Alignment including PAM & BLOSUM

Amino acid frequencies (Frequency of Occurrence):

1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014

The frequencies in the middle column are taken from Dayhoff (1978), the frequencies in the right column are taken from the 1991 recompilation of the mutation matrices representing a database of observations that is approximately 40 times larger than that available to Dayhoff.

Page 16: Aminoacid+Alignment including PAM & BLOSUM

Third step: Relative Mutabilities

• To obtain a complete picture of the mutational process,

the amino-acids that do not mutate are also taken into

account i.e., what is the chance, on average, that a given

amino acid will mutate at all.

• Based on the relative mutability scores of the amino

acids, Asn, Ser, Asp and Glu were observed to be most

mutable amino acids are Cys and Trp were the least

mutable.

Page 17: Aminoacid+Alignment including PAM & BLOSUM

Example: Phe - Tyr

• Of 1572 observed amino acid changes, there were 260

changes between Phe and Tyr

• These numbers were multiplied by (a) mutability of

Phe & (b) the fraction of Phe to Tyr changes over all

changes of Phe to another amino acid – to obtain

mutation probability score of Phe to Tyr

• A similar score was obtained for changes of Tyr

Page 18: Aminoacid+Alignment including PAM & BLOSUM

Example: Phe - Tyr

• The resulting scores were summed up and divided by a

normalizing factor such that their sum represents a

probability of change of 1% 250%

• Score for changing Phe to Tyr was 0.15

• Frequence of Phe occurrence in sequence data was 0.04

• Score for changing Tyr to Phe was 0.20

• Frequency of Tyr occurance in sequence data was 0.03

• These changes can include both forward and reverse

i.e., Phe Tyr as well as Tyr Phe

Page 19: Aminoacid+Alignment including PAM & BLOSUM

Example: Phe - Tyr

• Relative mutability of Phe to Tyr would be

• 0.15/0.04 = 3.75

• Converting to a log to the base 10 (log10 3.75 = 0.57)

• And multiplying it with 10 to remove fractional values =

5.7

• Relative mutability of Tyr to Phe would be

• 0.20/0.03 = 6.7 and log of this number = 0.83 further

multiplied by 10 would be 8.3

• Average of 5.7 and 8.3 is 7

Page 20: Aminoacid+Alignment including PAM & BLOSUM

Formulation of PAM matrix

• The amino acid exchange counts and mutability values

were then used to generate a 20 x 20 mutation probability

matrix representing all possible amino acid changes

Page 21: Aminoacid+Alignment including PAM & BLOSUM

• Amino acids are grouped according to chemistry of the side group:• C – Sulfhydryl• STPAG – Small hydrophilic • NDEQ – Acid, acid amine and hydrophilic• HRK – basic• MILV – small hydrophobic• FYW - Aromatic

+ Ancestor probability is greater

0 Probability of ancestry as well as by chance is same

- Alignment more by chance than ancestry

Page 22: Aminoacid+Alignment including PAM & BLOSUM

• Possible type of questions that can be answered are:

• “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”

Page 23: Aminoacid+Alignment including PAM & BLOSUM

Constructing BLOSUM Matrices

Blocks Substitution Matrices

Page 24: Aminoacid+Alignment including PAM & BLOSUM

BLOSUM matrices

• Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff 1992].

• For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

Page 25: Aminoacid+Alignment including PAM & BLOSUM

BLOSUM Scoring Matrices

• BLOck SUbstitution Matrix• Based on comparisons of blocks of sequences derived

from the Blocks database• The Blocks database contains multiply aligned ungapped

segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment)

• BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number

Page 26: Aminoacid+Alignment including PAM & BLOSUM

AABCDA...BBCDADABCDA.A.BBCBBBBBCDABA.BCCAAAAACDAC.DCBCDBCCBADAB.DBBDCCAAACAA...BBCCC

Conserved blocks in alignments

Page 27: Aminoacid+Alignment including PAM & BLOSUM

Collecting substitution statistics

1. Count amino acids pairs in each column; e.g.,– 6 AA pairs, 4 AB pairs, 4 AC, 1 BC, 0 BB, 0

CC. – Total = 6+4+4+1=15

2. Normalize results to obtain probabilities (pX’s and qXY’s)

3. Compute log-odds score matrix from probabilities:

s(X,Y) = log (qXY / (pX py))

AABACA

Page 28: Aminoacid+Alignment including PAM & BLOSUM

Estimation of a BLOSUM matrix• The BLOCKS database contains local

multiple gap-free alignments of proteins.

• All pairs of amino acids in each column of each BLOCK are compared, and the observed pair frequencies are noted (e.g., A aligned with A makes up 1.5% of all pairs; A aligned with C makes up 0.01% of all pairs, etc.)

• Expected pair frequencies are computed from single amino acid frequencies. (e.g, fA,C=fA x fC=7% x 3% = 0.21%).

• For each amino acid pair the substitution scores are essentially computed as:

ID FIBRONECTIN_2; BLOCKCOG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATTCOG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTTFA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATTHGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTHMANR_HUMAN GNANGATCAFPFKFENKWYADCTSAGRSDGWLWCGTTMPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTANPB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTYSFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDADSFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDADSFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTESP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVTCOG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCASTCOG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATTCOG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATTCOG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATSCOG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATTCOG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATTCOG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATTCOG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATTCOG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATTFINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTTFINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTTFINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTTMPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTANMPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTADPA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATTPA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT

Pair-freq(obs)Pair-freq(expected)

log 0.01%0.21%

SA,C = log = -1.3

Page 29: Aminoacid+Alignment including PAM & BLOSUM

Constructing a BLOSUM matr.1. Counting mutations

Page 30: Aminoacid+Alignment including PAM & BLOSUM

2. Tallying mutation frequencies

Page 31: Aminoacid+Alignment including PAM & BLOSUM

3. Matrix of mutation probs.

Page 32: Aminoacid+Alignment including PAM & BLOSUM

4. Calculate abundance of each residue (Marginal prob)

Page 33: Aminoacid+Alignment including PAM & BLOSUM

5. Obtaining a BLOSUM matrix

Page 34: Aminoacid+Alignment including PAM & BLOSUM

Constructing BLOSUM r

• To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical

• The elimination is done by either – removing sequences from the block, or

– finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.

• BLOSUM r is the matrix built from blocks with no more the r% of similarity– E.g., BLOSUM62 is the matrix built using sequences with no

more than 62% similarity.

– Note: BLOSUM 62 is the default matrix for protein BLAST

Page 35: Aminoacid+Alignment including PAM & BLOSUM

Obtaining BLOSUM62 Matrix

ji

ijij pp

pS 2log2

Page 36: Aminoacid+Alignment including PAM & BLOSUM

PAM & BLOSUM

The PAM family

• PAM matrices are based on global alignments of closely related proteins.

• The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence.

• Other PAM matrices are extrapolated from PAM1.

Page 37: Aminoacid+Alignment including PAM & BLOSUM

PAM & BLOSUM

The BLOSUM family

• BLOSUM matrices are based on local alignments. • BLOSUM 62 is a matrix calculated from comparisons of

sequences with no less than 62% divergence. • All BLOSUM matrices are based on observed alignments;

they are not extrapolated from comparisons of closely related proteins.

• BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Page 38: Aminoacid+Alignment including PAM & BLOSUM

PAM & BLOSUM

BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.

Rat versus mouse protein

Rat versus Bacterial protein