1 molecular evolution stat 246, spring 2002, week 6a
Post on 01-Jan-2016
228 Views
Preview:
TRANSCRIPT
1
Molecular Evolution
Stat 246, Spring 2002, Week 6a
2
Evolution using molecules
•Our DNA is inherited from our parents more or less unchanged.
•Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection.
•Mutations accumulate at fairly steady rates in surviving lineages.
•We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.
3
Some important dates in history(billions of years ago)
Origin of the universe 15 4
Formation of the solar system 4.6
First self-replicating system 3.5 0.5
Prokaryotic-eukaryotic divergence 1.8 0.3
Plant-animal divergence 1.0
Invertebrate-vertebrate divergence 0.5
Mammalian radiation beginning 0.1
86 CSH Doolittle et al.
4
5
Two important early observations
Different proteins evolve at different rates, and this seems more or less independent of the host
organism, including its generation time.
It is necessary to adjust the observed percent difference between two homologous proteins to get a distance more or less linearly related
to the time since their common ancestor.
6
Evolution ofthe globins
Hemoglobin
Fib
rinop
eptid
es1.
1 M
Y
5.8 MY
Cytochrome c
20.0 MYSeparation of ancestorsof plants and animals
1
23
4
67
8 910
5
Mam
mal
s
Bird
s/R
eptil
es
Rep
tiles
/Fis
h
Car
p/La
mpr
ey
Mam
mal
s/R
eptil
es
Ver
tebr
ates
/In
sect
s
a
220
200
180
160
140
120
100
80
60
40
20
0
200100 300 400 500 600 700 800
Millions of years since divergenceAfter Dickerson (1971)
Cor
rect
ed a
min
o ac
id c
hang
es p
er 1
00 r
esid
ues
900 1000 1100 1200 1300 1400
bcde f h ig j
Hur
on
ian
Alg
on
kia
n
Cam
bria
n
Ord
ovi
cian
Silu
ria
nD
evo
nia
n
Per
mia
nTr
iass
icJu
rass
ic
Cre
tace
ous
Pal
eo
cen
e
Olig
oce
ne
Mio
cen
eP
lioce
ne
Eoc
en
e
Car
bo
nife
rou
s
Rates of macromolecular evolution
7
Protein PAMsa/100 residues Theoretical 108 years lookback timeb
Pseudogenes 400 45c
Fibrinopeptides 90 200c
Lactalbumins 27 670c
Lysozymes 24 750c
Ribonucleases 21 850c
Hemoglobins 12 1.5d
Acid proteases 8 2.3d
Triosephosphate isomerase 3 6d
Phosphoglyceraldehyde dehydrogenase 2 9d
Glutamate dehydrogenase 1 18d
______________________________________________________________________________________
aPAMs, Accepted point mutations.
bUseful lookback time = 360 PAMs.
cMillion years.
dBillion years. Doolittle 1986
Different rates of change for different proteins
8
Rates of change in protein familiesProtein Ratea Protein Rate
Fibrinopeptides 90 Thyrotropin beta chain 7.4Growth hormone 37 Parathyrin 7.3Ig kappa chain C region 37 Parvalbumin 7.0Kappa casein 33 BPTI Protease inhibitors 6.2Ig gamma chain C region 31 Trypsin 5.9Lutropin beta chain 30 Melanotropin beta 5.6Ig lambda chain C region 27 Alpha crystallin A chain 5.0Complement C3a 27 Endorphin 4.8Lactalbumin 27 Cytochrome b5 4.5Epidermal growth factor 26 Insulin 4.4Somatotropin 25 Calcitonin 4.3Pancreatic ribonuclease 21 Neurophysin 2 3.6Lipotropin beta 21 Plastocyanin 3.5Haptoglobin alpha chain 20 Lactate dehydrogenase 3.4Serum albumin 19 Adenylate cyclase 3.2Phospholipase A2 19 Triosephosphate isomerase 2.8Protease inhibitor PST1 type 18 Vasoactive intestinal peptide 2.6Prolactin 17 Corticotropin 2.5Pancreatic hormone 17 Glyceraldehyde 3-P DH 2.2Carbonic anydrase C 16 Cytochrome C 2.2Lutropin alpha chain 16 Plant ferredoxin 1.9Hemoglobin alpha chain 12 Collagen 1.7Hemoglobin beta chain 12 Troponin C, skeletal muscle 1.5Lipid-binding protein A-II 10 Alpha crystallin B-chain 1.5Gastrin 9.8 Glucagon 1.2Animal lysozyme 9.8 Glutamate DH 0.9 Myoglobin 8.9 Histone H2B 0.9Amyloid A 8.7 Histone H2A 0.5Nerve growth factor 8.5 Histone H3 0.14Acid proteases 8.4 Ubiquitin 0.1Myelin basic protein 7.4 Histone H4 0.1
apercent/100My From (Nei, 1987; Dayhoff et al., 1978)
9
Beta-globins (orthologues)10 20 30 40
M V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q BG-human- . . . . . . . . N . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . BG-macaque
- - M . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . BG-bovine- . . . S G G . . . . . . N . . . . . . I N . L . . . . . . . . . . . . . . . . BG-platypus
. . . W . A . . . Q L I . G . . . . . . . A . C . A . . . A . . . I . . . . . . BG-chicken- . . W S E V . L H E I . T T . K S I D K H S L . A K . . A . M F I . . . . . T BG-shark
50 60 70 80
R F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D BG-human. . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . N . . . BG-macaque
. . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . BG-bovine. . . . A . . . . . S A G . . . . . . . . . . . . A . . . T S . G . A . K N . . BG-platypus
. . . A . . . N . . S . T . I L . . . M . R . . . . . . . T S . G . A V K N . . BG-chicken. Y . G N L K E F T A C S Y G - - - - - . . E . A . . . T . . L G V A V T . . G BG-shark
90 100 110 120
N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G BG-human. . . . . . . Q . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . . BG-macaque
D . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . BG-bovineD . . . . . . K . . . . . . . . . . . . . . . . N R . . . . . I V . . . R . . S BG-platypus. I . N . . S Q . . . . . . . . . . . . . . . . . . . . D I . I I . . . A . . S BG-chicken
D V . S Q . T D . . K K . A E E . . . . V . S . K . . A K C F . V E . G I L L K BG-shark
130 140
K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y HBG-human. . . . . Q . . . . . . . . . . . . . . . . . . . . .BG-macaque
. . . . . V L . . D F . . . . . . . . . . . . . R . .BG-bovine. D . S . E . . . . W . . L . S . . . H . . G . . . .BG-platypus. D . . . E C . . . W . . L . R V . . H . . . R . . .BG-chicken
D K . A . Q T . . I W E . Y F G V . V D . I S K E . . BG-shark
. means same as reference sequence
- means deletion
10
Beta-globinsUncorrected pairwise distances
DISTANCES between protein sequences Calculated over: 1 to 147 Below diagonal: observed number of differences Above diagonal: number of differences per 100 amino acids
hum mac bov pla chi sha
hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 bov 23 24 ---- 27 37 65
pla 34 34 39 ---- 29 64
chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ----
11
Beta-globinsCorrected pairwise distances
DISTANCES between protein sequences Calculated over: 1 to 147 Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor hum mac bov pla chi sha
hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 bov 23 24 ---- 32 46 110
pla 34 34 39 ---- 34 106
chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ----
12
UPGMA tree
BG-bovine
BG-humanBG-macaque
BG-platypus
BG-chicken
BG-shark
13
10 20 30 40 50 60
C C G A C A G G C A C G G T G G C T C A C A C C T G T A A T C C C A G T A C T T T G G G A G G C T G A G G C G A G A G G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . chimp. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . bonob. . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . C . . . . . . . . G . . . . goril. G . . . . . A . . . . . . . . . . . . . G . . . . . . . . . . . . . C . . . . . . . . . . . . C . . . . T . G . C . . orang
70 80 90 100 110 120
A T C A C C T G A G G T C G G G A G T T T G A G A C C A G C C T G A C C A A T A T G G A G A A A C C C C A G T T A T A C hum-1. . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . hum-3. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . goril. . . . . . . . . . . . T . . . . . . . C . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . orang
130 140 150 160 170 180
T A A A A A T A C A A A A T T A G C T G G G T G T G G T G G C G C A T G C C T G T A A T C C T A G C T A C T A G G A A G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . chimp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . bonob. . . . . . . . . . . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . goril. . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . T . C . . . . . . . . . . G . . orang
190 200 210 220 230 240
G C T G A G G C A G G A G A A T C G C T T G A A C C C G G G A G G T G G A G G T T G A G G T G A G C T G A G A T C A C G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . A . . . . . . . . . T . . . . . . . . . . . . . . . . . A goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . G . . orang
250 260 270 280 290 300
C C A T T G C A C T C C A G C C T G G G C A A C A A G A G C A A A A C T C C G T C T C A A A A A A T A A A T A A A T A A hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . C . . . . chimp. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . T . . . . . . . . . . . . . . . . . orang
Alu sequences (-globin2 Alu 1, Knight et al., 1996)
14
Alu sequencesUncorrected pairwise distances
DISTANCES between nucleic acid sequences Calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: number of differences per 100 bases hum-1 hum-2 hum-3 chimp bonob goril orang
hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7
chimp 12 13 13 ---- 1 5 6
bonob 10 11 11 2 ---- 4 5 goril 14 15 15 14 12 ---- 7
orang 20 21 21 18 16 22 ----
15
Alu sequencesCorrected pairwise distances
DISTANCES between nucleic acid sequences Calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 bases Correction method: Jukes-Cantor
hum-1 hum-2 hum-3 chimp bonob goril orang
hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7
chimp 12 13 13 ---- 1 5 6
bonob 10 11 11 2 ---- 4 6 goril 14 15 15 14 12 ---- 8
orang 20 21 21 18 16 22 ----
16
Correcting distances between DNA and protein sequences
We mentioned earlier that it is necessary to adjust observed percent differences to get a distance measure which scales linearly with time. This is because we can have multiple and back substitutions at a given position along a lineage.
All of the correction methods (with names like Jukes-Cantor, 2-parameter Kimura, etc) are justified by probabilistic arguments, whose basis is worth mastering. The same evolutionary ideas are used in scoring sequence alignments.
17
The Jukes-Cantor model of nucleotide substitution
Commonancestor ofhuman and orang.
t time units
human (now)
infinitesimalmatrix Q =
A G C T
A -3 G -3 C -3 T -3
Consider the nt at the 2nd position in a-globin2 Alu1
= rate of substitution of one nt by another, assumed constant
18
A simple 4-state Markov chain: the Kimura 2-parameter model for nucleotide change
A G
TC
Transitions rates: Horizontal: a Diagonal & vertical: b Self: c = a2b
c a b b
a c b b
b b c a
b b a c
A G C TA
G
C
T
19
Markov chain
State space = {A,C,G,T}.
P(i,j) = pr(next state Sj | current state Si)Markov assumption:
P(i,j) = pr(next state Sj | current state Si & any configuration of states before this)
Only the present state, not previous states, affects the probs of moving to next states.
20
The multiplication rulepr(state after next is Sk | current state is Si)= ∑j pr(state after next is Sk, next state is Sj | current state is Si) [addition rule]= ∑j pr(next state is Sj| current state is Si) x pr(state after next is Sk | current state is Si, next state is Sj) [multiplication rule]= ∑j P(i,j) x P(j,k) [Markov assumption]= (i,k)-element of P2, where P=(P(i,j)).
More generally, pr(state t steps from now is Sk | current state is Si) = i,k element of Pt
21
Continuous-time version
For any s,t write pij(t) = pr(Sj at time t+s | Si at time s) for
stationary (time-homogeneous) transition probabilities.
Write P(t) = (pij(t)) for the matrix of pij(t) ‘s.
Then for any t,u: P(t+u) = P(t) P(u).
It follows that P(t) = exp(Qt), where Q = P’(0) is the
derivative of P(t) at t = 0.
Q is called the infinitesimal matrix of P(t), and satisfies
P’(t) = QP(t) = P(t)Q.
22
Interpretation of Q
Roughly, q(i,j) is the rate of transitions of i to j, while
q(i,i) = j q(i,j), so each row sum is 0.
If under some initial conditions, we have a Markov
chain evolving in continuous time with infinitesimal
matrix Q, and pj(t) = pr(Sj at time t), then
pj(t+h) =i pr(Si at t, Sj at t+h)
= i pr(Si at t)pr(Sj at t+h | Si at t)
= pj(t)x(1+hqjj) + i pi(t)x hqij
≠
≠
≠
≠
→
23
i.e., h-1[pj(t+h) - pj(t)] = pj(t)q(j,j) + i pi(t)q(i,j)
which becomes P’ = QP as h 0.
Important approximation:
When t is small, P(t) I + Qt.
24
Q =
P(t) =
Jukes-Cantor model (1969)
-3 -3 -3 -3
r s s s
s r s s
s s r s
s s s r
r = (1+3e4t)/4, s = (1 e4t)/4.
25
Let P(t) = exp(Qt). Then the A,G element of P(t) is
pr(G now | A then) = (1 e4t)/4.
Same for all pairs of different nucleotides.
Overall rate of change k = 3t.
When k = .01, described as 1 PAM
PAM = accepted point mutation
Put t = .01/3 = 1/300. Then the resulting
P = P(1/300) is called the PAM(1) matrix.
Why use PAMs?
26
Evolutionary time, PAM
Since sequences evolve at different rates, it is
convenient to rescale time so that 1 PAM of evolutionary time corresponds to 1% expected substitutions.
For Jukes-Cantor, k = 3t is the expected number of substitutions in [0,t], so is a distance. (Show this.)
Set 3t = 1/100, or t = 1/300, so 1 PAM = 1/300 years.
27
Distance adjustment
For a pair of sequences, k = 3t is desired, but not observable. Instead, pr(different) is observed. Convert pr(different) to k.
This is very similar to the conversion of = pr(recombination) to genetic distance (expected number of crossovers) using the Haldane function
= 1/2 (1 e-2d),
assuming the Poisson model.
28
common ancestor
Gorang
Chuman
Still 2nd position in a-globin Alu 1
Assume that the common ancestor has A, G, C or T with probability 1/4.
Then the chance of the nt differing
p≠ = 3/4 (1 e8t)
= 3/4 (1 e4k/3), since k =2 3t
t
3/4
29
If we suppose all positions behave identically and independently, and n≠ differ out of n, we can invert this, obtaining
= 3/4 log(1 4/3 n≠/n).
This is the corrected or adjusted fraction of differences (under this simple model). 100 to get PAMs
The analogous simple model for amino acid sequences has
= 19/20 log(1 20/19 n≠/n).
100 for PAM.
30
Illustration
1. Human and bovine beta-globins are aligned with no deletions at 145 out of 147 sites. They differ at 23 of these sites. Thus n≠/n = 23/145, and the corrected distance using the Jukes-Cantor formula is (natural logs)
19/20 log(1 20/19 23/145) = 17.3 10-2.
2. The hum-1 and gorilla sequences are aligned without gaps across all 300 bp, and differ at 14 sites. Thus n≠/n = 14/300, and the corrected distance using the Jukes-Cantor formula is
3/4 log(1 4/3 14/300) = 4.8 10-2.
31
Correspondence between observed a.a. differences and the evolutionary
distance (Dayhoff et al., 1978)
Observed Percent Difference Evolutionary Distance in PAMs
1511172330384756678094112133159195246
1510152025303540455055606570758085 328
32
The stationary distribution
A probability distribution on {A,C,G,T}, , is a stationary distribution if
i (i) P(i,j) = (j), for all j.
Fact: Given any initial distribution, the distribution at time t as t .
For the Jukes-Cantor and Kimura models, is the uniform distribution. Assume ancestor
sequence is IID .
33
Reversibility
A Markov chain is reversible if
(i) P(i,j) = P(j,i) (i), for all i,j (detailed balance).
Under reversibility, the human sequence can be considered the ancestor of the orang sequence and vice versa. (Proof next slide.)
Both the Jukes-Cantor and Kimura models are
reversible.
34
Proofpr( G in orang, C in human) =
i pr(ancestor base is i)pr(G in orang|i in ancestor)pr(A in human|i in ancestor)
i (i) P(t,i,G) P(t,i,C)
= i (G) P(t,G,i) P(t,i,C) (reversibility) = (G) P(2t,G,C) (addition rule) = F(2t,G,C).The matrix F(2t) is the joint distribution at times
0 and 2t. It is symmetric, and both itsrow and column sum is the stationary
distribution .
35
Below diagonal: BLOSUM62 substitution matrixAbove diagonal: Difference matrix obtained by subracting the PAM 160 matrix entrywise.From Henikoff & Henikoff 1992
C S T P A G N D E Q H R K M I L V F Y W
0 -1 1 0 2 1 1 2 1 2 0 0 2 4 1 5 1 2 -2 5 C
2 0 -2 0 -1 0 0 0 1 0 0 0 1 0 1 -1 1 1 -1 S
C 9 2 -1 -1 -1 0 0 0 0 0 0 -1 0 -1 1 0 1 1 3 T
S -1 4 2 -2 -1 -1 0 0 -1 -1 -1 1 1 0 -1 0 0 2 1 P
T -1 1 5 2 -1 -2 -2 -1 0 0 1 1 0 0 1 0 1 1 2 A
P -3 -1 -1 7 2 0 -1 -2 0 1 1 0 0 -1 0 -1 1 2 4 G
A 0 1 0 -1 4 3 -1 -1 0 0 1 -1 0 -1 0 -1 0 0 0 N
G -3 0 -2 -2 0 6 2 -1 -1 -1 0 -1 0 0 0 0 2 1 3 D
N -3 1 0 -2 -2 0 6 1 0 0 2 2 1 -1 0 0 2 2 4 E
D -3 0 -1 -1 -2 -1 1 6 0 -2 0 1 1 -1 0 0 1 3 3 Q
E -4 0 -1 -1 -1 -2 0 2 5 2 -1 0 1 0 -1 0 1 2 2 H
Q -3 0 -1 -1 -1 -2 0 0 2 5 -1 -1 0 -1 1 0 1 3 -4 R
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 1 -2 -1 1 1 2 3 1 K
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 -2 -1 -1 0 1 2 4 M
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 -1 1 0 0 1 3 I
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 -1 0 -1 1 2 L
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 0 1 2 4 V
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 -1 -2 1 F
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 2 Y
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 -1 W
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
C S T P A G N D E Q H R K M I L V F Y W
36
PAM matrices
Trees are constructed for closely related (% identity > 15) protein sequences from 71 families.
Internal nodes (intermediate and youngest common ancestors) are inputed by parsimony.
Number of replacements on each branch are totalled across trees; this gives a 20 by 20 matrix of counts. The count matrix is symmetrized, call it A.
37
Transitition probability from i to j is estimated by
P (i,j) = A(i,j)/k A(i,k).Now P might or might not be PAM1.
To get the PAM1 matrix, P has to be calibrated.
Let Q = P – I. Q is very close to the underlying rate matrix, up to a scale factor. Set P() = I + Q . Need to find so that P() is PAM1.
Recall: 1 PAM = 1% expected number of substitutions, or approximately 1% = pr(ancestor is different from descendant).
38
0.01 = pr(ancestor is different from descendant)
= i (i) ji P(,i,j)
= i (i)(1 – P(,i,i))
= i (i) Q (i,i).
So let = 0.01/{i (i) Q (i,i)}.
Then P() is PAM1.
PAM250 = PAM1250, etc.
39
A R N D C Q E G H I L K M F P S T W Y V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A Ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18
R Arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1
N Asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1
D Asp 6 0 42 9859 0 6 53 6 4 1 0 6 0 0 1 5 3 0 0 1
C Cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2
Q Gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1
E Glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 1 1 2
G Gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5
H His 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1
I Ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33
L Leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15
K Lys 2 37 2 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1
M Met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4
F Phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0
P Pro 13 5 2 1 1 8 3 2 4 1 2 2 1 1 9926 12 4 0 0 2
S Ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 68 5 2 2
T Thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9
W Trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0
Y Tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1
V Val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
Transition probability matrix: PAM(1) x 104 (after M. Dayhoff, 1978)
40
A R N D C Q E G H I L K M F P S T W Y V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A Ala 13 6 9 9 4 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9
R Arg 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2
N Asn 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3
D Asp 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3
C Cys 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2
Q Gln 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3
E Glu 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3
G Gly 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7
H His 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2
I Ile 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9
L Leu 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13
K Lys 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5
M Met 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2
F Phe 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3
P Pro 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4
S Ser 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6
T Thr 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6
W Trp 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0
Y Tyr 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2
V Val 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17
Transition probability matrix: PAM(250) x 102 (after M. Dayhoff, 1978)
41
BLOck SUbstitution Matrices (BLOSUM)
Built from gapless multiple alignments (blocks) of homologous protein sequences at various distances.
Basic idea: for a block, sum up matrices of counts for all pairwise comparisons, then sum across blocks and symmetrize. This symmetric count matrix is normalized so that all entries sum to 1. Call this normalized matrix F.
The row or column sums of F is . The score for (i,j) is log {F(i,j)/ (i)(j)}.
42
Matrices at different distances
To produce matrices at different distances, thresholds,clustering and weights are used.
For example, BLOSUM80 uses the threshold 80%. In each block, any group of sequences that are more similar than 80% are clustered. The clusters are down-weighted by their size in summing the counts.
43
For example, suppose there are 3 clusters of size 1, 5 and 10 in a block of 16 sequences.
Then each count matrix of comparing a sequence from cluster 1 to a sequence from cluster 2 is divided by 1 x 5.
Similarly, for cocmparing cluster 2 and cluster 3, the factor is 5 x 10.
At a lower threshold, say at 45%, the clusters are bigger, and the factors are also larger, so similar sequences are downweighted more at lower thresholds, and BLOSUM45 is better for divergent sequences than BLOSUM80.
44
How did variations arise?
Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation.
Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around.
Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.
45
Modeling DNA base substitution
Strictly speaking, only applicable to regions undergoing little selection.
Assumptions 1. Site independence. 2. Site homogeneity. 3. Markovian: given current base, future
substitutions independent of past. 4. Temporal homogeneity: stationary Markov
chain.
46
A pair of homologous bases
Typically, ancestor is unknown.
ancestor
A C
QhQm
T years
47
More assumptions
5. Qh = sh Q and Qm = sm Q, for some positive
sh, sm, and a rate matrix Q. 6. The ancestor is sampled from the
stationary distribution of Q. 7. Q is reversible: (a) P(t,a,b) = P(t,b,a) (b), b, t 0
(detailed balance).
48
New picture
ancestor ~
A
C
shT PAMs smT
PAMs
49
Joint probability of A and C
Under the model, the joint probability is
a (a) P(shT,a,A) P(smT,a,C)
= a (A) P(shT,A,a) P(smT,a,C)
= (A) P(shT+ smT,A,C) = F(t,A,C).The matrix F(t) is symmetric. It is equally valid to
view A as the ancestor of C or vice versa.
t = shT+ smT is the “distance” between A and C.
Note: sh , sm and T are not identifiable.
50
Estimating distance of two sequences
Suppose two protein sequences a1…an and b1…bn are separated by t PAMs.
Under a reversible substitution model that is IID across sites, the likelihood of t is
L(t) = Pr(a1…an,b1…bn | model)
= k F(t,ak,bk)
= a,b F(t,a,b)c(a,b),
where c(a,b) = # {k : ak = a, bk = b}.Maximizing this quantity gives the maximum likelihood
estimate of t. This generalizes the distance correction with Jukes-Cantor.
51
M pairs
The kth pair of sequences, separated by tk PAMs, gives the count matrix c(k).
Pr(all pairs) = k a,b F(tk,a,b)c(k,a,b).If the distances are about constant, then the percent identity should be about constant across pairs. Often, this is not so.
Parameters : Q, t1, t2,…,tM.Maximum likelihood estimation by numerical methods.
52
One simple method
Start with an estimate of Q, Q0. Fixing Q0, maximize the pair-specific likelihoods as functions of the distances to get first estimates of the distances.
Keeping the distances fixed, maximize the overall likelihood as a function of Q.
Go back and forth until convergence.
53
The likelihood has exactly one maximum in t, so estimating distance is easy.
To find the MLE of Q, randomly perturb current estimate and update if the new candidate has a higher likelihood.
top related