current challenges for molecular phylogenetics barbara holland school of mathematics & physics...
TRANSCRIPT
Current challenges for molecular phylogenetics
Barbara Holland
School of Mathematics & Physics
University of Tasmania
Mostly statistical
Charles Darwin and Alfred Russell Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines
Phylogenetics seeks to determine these genetic relationships
Darwin’s sketch: the first phylogenetic tree? Charles Darwin
Alfred Russel Wallace
Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “tree of life”.
Ernst Haeckel’s Tree of Life (1866)
Why molecular phylogeny
• Most molecules evolve independently of adaptations affecting morphology.
• It is fairly easy to find genes that are present in all species of interest, e.g., a 12S RNA molecule in mitochondria is functional over all mammals.
• Useful mathematical models of sequence evolution have been developed that underpin attempts to infer evolutionary trees
A brief and incomplete history of molecular phylogenetics
60s
70s
80s
90s
00s
Antibodies
DNA-DNA hybridisationSarich Wilson
Assessing support - BootstrapExplicit Models - Maximum likelihood Systematic bias – Felsenstein Zone
Felsenstein
Sequence Data (Amino acid then DNA)
Parsimony
Distance based
More complex models - Bayesian methods
Various perils...anomalous gene trees, non identifiable models
Population processes, gene trees in species trees
MORE Sequence Data
PCR
The molecular phylogeny problem
ACCGCTTA
ACTGCTTA
ACTGCTAAACTGCTTA
ACTGCTAA
ACCCCTTA
ACCCCTTA
Tim
e
ACCCCATA
…ACCCCTTA……ACCCCATA……ACTGCTTA……ACTGCTAA…
We see the alignedmodern day sequences
And want to recover theunderlying evolutionarytree.
?
Sequence evolution is modelled as a Markov processtim
e
A
C
Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}.
The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written pij(t).
A
T
G
time
t
Continuous time Markov chains
A C G T
A pAA pAC pAG pAT
C pCA pCC pCG pCT
G pGA pGC pGG pGT
T pTA pTC pTG pTT
M =
A C G T
A -qA* qAC qAG qAT
C qCA -qC* qCG qCT
G qGA qGC -qG* qGT
T qTA qTC qTG -qT*
Q = Where qi* = Σj qij, j ≠ i
i.e. rows sum to zero.
M = exp(Qt)
Transition matrix Instataneous rate matrix
Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric.
Models of nucleotide substitution
• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
α
α
ααα
α
Models of nucleotide substitution
• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
β
β
ααα
α
Models of nucleotide substitution
• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
β
β
ααα
α
Models of nucleotide substitution
• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
β
γζα
ε
δ
Models define probability distributions on site patterns
The model θ consists of: the tree topology, edge weights, Q matrix*, and root distribution π.
*More generally, this could be a set of Q matrices
1 2 3
M1 M2
M12
M3
Edge weights t1, t2, t3, t12
Me = exp(Qte)
pijk = Σx,y M1(x,i) M2(x,j) M12(y,x) M3(y,k) π(y)
x
y
Tree estimation using maximum likelihood
• For a given set of parameters θ we can calculate the probability of any particular site pattern.
• The overall probability of an alignment is then taken to be the the product of the probabilities for each site (i.i.d assumption).
• This is the likelihood function, i.e. the probability of the data given the model.
• We can then use optimisation techniques to find the model parameters (tree topology, edge lengths, parameters of the substitution model) that maximise the likelihood.
Extra features of sequence evolution that can be modelled
• Site to site rate variation (usually modelled by a gamma distribution)
• Invariant sites
BUT• Some parts of reality are problematic…
– Base composition bias – Sites that are free to vary change across the tree– Non independence of sites
Likelihood versus parsimony(the Felsenstein Zone)
Prior to the introduction of ML to phylogenetics community by Joe Felsenstein Maximum Parsimony (MP) was the most widely used method for estimating phylogenetic trees.
MP chooses the tree that requires the fewest mutations to explain the data
A
B
C
D
A
A
G
G
G
A
A
D
BA
G
G
A
A
A
C
Likelihood versus parsimony(the Felsenstein Zone)
The MP criterion has been shown to be statistically inconsistent on some trees under the models of nucleotide substitution discussed previously. Likelihood is statisitically consistent (given the correct model).
Felsenstein (1978) Hendy & Penny (1989)
Assessing confidence
• It is not just of interest to get a point estimate of the phylogenetic tree.
• We would also like some measure of confidence in our point estimate.– Is our tree likely to change if we get more data?– How robust is our result to sampling error?
• The bootstrap is a useful tool for answering these sorts of questions.
The bootstrap (Felsenstein 1985)
• For each bootstrap sample:– Create a new alignment (of the same length as
the original) by resampling the columns of the observed alignment
– Construct a tree for the ‘bootstrap’ alignment
• The bootstrap support for each edge is the number of bootstrap trees that edge appears in.
1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT
1224567a ATTTAAAb ATTATAAc TAAAATAd TAAAAAT
1334567a AAATAAAb ATTATAAc TAAAATAd TTTAAAT
1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT
1244567a ATTTAAAb ATAATAAc TAAAATAd TAAAAAT
a a a a
b
b
b b
c
c
c c
d d d d
a
b
c
d
0.75
Example where the bootstrap is useful
• Simulate data on the four taxon tree below (JC model)
• Use sequence lengths of 100, 1000, and 10000
100 1000 10000((a,b),(c,d)) 5.7% 97% 100%((a,c),(b,d)) 42.8% <5% 0((a,d),(b,c)) 49.8% <5% 0
0.2
0.01
a b c d
Example where it is not so useful
• Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences
• Use total sequence lengths of 100, 1000, and 10000
100 1000 10000((a,b),(c,d)) 64% 80% 98%((a,c),(b,d)) 33% 20% <5((a,d),(b,c)) 3% 0% <5
0.1
0.05
a b c d
0.1
0.05
a c b d
55%
45%
Genome-scale phylogeny
• Data sets with many concatenated genes
– Rokas et al, Nature 2003 (106 genes, 8 taxa)
– Goremykin et al, MBE 2004 (61 genes, 14 taxa)
• Estimated trees have very high bootstrap support.
• BUT... trees are sensitive to: model used, method used, data-coding.
NJ bootstrap with ML distances using a GTR + gamma model
0
20
40
60
80
100
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
alpha (gamma shape parameter)
bo
ots
trap
su
pp
ort
Amb+Nym
Grasses
Amb
Nym
Skewed rates Equal rates
Sensitivity to model choice
• Phylogenomic datasets may involve hundreds of genes for many species.
• These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes.
• One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.
Example
C. albicans
S. kluyveri
S. kudriavzevii
S. bayanus
S. cerevisiae
S. paradoxus
S. mikatae
S. castellii
Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa
Two extremes
• How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ?Between
13 (consensus tree) &
13 x 106 = 1378• Too few parameters introduces bias• Too many parameters increases the variance
Stochastic partitioning
• Attempts to cluster genes into classes that have evolved in a similar fashion.
• Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution)
Algorithm overview
1. Randomly assign the n genes to k classes.2. Optimise parameters for each class3. Compute the posterior probability for each
gene with the parameters from each class.4. Move each gene into the class for which it has
highest posterior probability5. Go to step 2, when no genes change class
STOP
Conclusions regarding stochastic partitioning
• Pros– AIC/BIC allows you a quantitative method to choose
how many parameters are needed.– Identifies groups of genes under similar constraints
• Cons– Slow– Randomized algorithm so different starting points lead
to different partitions.
Brief Tour…
Combinatorics of tree spaceGraph TheoryStochastic Models, Inference & Probability TheoryAlgebraic GeometryLie groups, representation theory….….
Figure 2 Matsen and Steel (2007)
…the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology.
Identifiability
Representation theory, Lie groups, Markov invariants, closure of model
classes
Jeremy Sumner Peter Jarvis