introduction to phylogenetics - barbara holland
DESCRIPTION
Molecular sequences provide an excellent source of data for attempting to infer the evolutionary relationships between species. This talk will introduce model-based methods of phylogenetic inference (maximum likelihood and Bayesian inference) and discuss both the strengths and limitations of current approaches.TRANSCRIPT
![Page 1: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/1.jpg)
Introduction to Phylogenetics
Barbara Holland
BioInfoSummer 2013
![Page 2: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/2.jpg)
What is phylogenetics?
Darwin’s sketch: the first phylogenetic tree?
The goal of phylogenetics is to infer evolutionary relationships between species.
This includes both information about order of branching, .e.g., did humans and chimpanzees share a common ancestor more recently than humans, chimps and gorillas? And information about timing of events, e.g., how long ago did humans and chimps share an ancestor?
![Page 3: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/3.jpg)
Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “Tree of Life”.
Ernst Haeckel’s Tree of Life (1866)
![Page 4: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/4.jpg)
Why might we care?
• Understanding human origins
• Understanding biogeography, e.g. what’s the relative importance of dispersal versus vicariance?
• Understanding the origin of particular traits
• Understanding the processes of molecular evolution
• Learning about the tempo of evolution, e.g. was the Cambrian explosion really an explosion? Did mammals and birds wait until dinosaurs went extinct to inherit the earth or were they already started before the asteroid hit?
• Origin of disease, e.g. where did humans get AIDs from?
![Page 5: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/5.jpg)
![Page 6: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/6.jpg)
![Page 7: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/7.jpg)
Understanding the emergence of disease
http://evolution.berkeley.edu/evolibrary/news/081101_hivorigins
![Page 8: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/8.jpg)
Special reasons for Australians to care…
http://archive.peabody.yale.edu/exhibits/treeoflife/predictions.html
![Page 9: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/9.jpg)
How to “read” a tree
Rat
Mouse
Cat
Dog
A tree is a connected acyclic graph
Composed of:Nodes / verticesEdges / arcs / branchesPendent (external) edges versus internal edges
In a phylogenetic tree the external (degree 1) nodes are associated with labels
![Page 10: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/10.jpg)
• Trees can be rooted or unrooted• Rooted trees are more interpretable• But, most software returns unrooted trees
Tree basics
Cat Dog MouseRat
root Rat
Mouse
Cat
Dog
![Page 11: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/11.jpg)
• Trees can be weighted or unweighted
• Edge weights are used to represent – the amount of genetic change along the edge– time
CatDog
Mouse
Rat
Rat
Mouse
Cat
Dog
Tree basics
![Page 12: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/12.jpg)
• Trees can be bifurcating (binary) or multifurcating (non-binary)
• Polytomies usually represent uncertainty (a soft polytomy), but sometimes they are used to mean that everything happened at exactly the same time (a hard polytomy).
Human Chimp Gorilla Monkey
polytomy(multifurcation)
Tree basics
![Page 13: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/13.jpg)
The same tree can look different
A AB BC D CDE F F E
![Page 14: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/14.jpg)
Newick format
A B C D E F
A computer readable format for describing trees that uses brackets and commas.
((A,(B,(C,D))),(E,F));
( , );((A, ),(E,F));((A,(B, )),(E,F));
Treeview / Dendroscope
![Page 15: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/15.jpg)
What data can we use?• Molecular sequence data
– DNA alignments*– Amino acid alignments
• Presence absence data– Gene content– Fragment based methods (AFLP, DArT)– SNP chips
• Genetic distances (DNA-DNA hybridisation, immunology)• Rare traits
– Gene order– Introns?– SINEs and LINEs (short and long interspersed retro-transposable elements )
• Morphological data
* Alignment is an important problem in its own right
![Page 16: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/16.jpg)
The molecular phylogeny problem
ACCGCTTA
ACTGCTTA
ACTGCTAAACTGCTTA
ACCCCTTA
ACCCCTTA
Tim
e
ACCCCATA
…ACCCCTTA……ACCCCATA……ACTGCTTA……ACTGCTAA…
We see the alignedmodern day sequences
And want to recover theunderlying evolutionarytree.
![Page 17: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/17.jpg)
Sometimes the data agrees
ACCGCTTA
ACTGCTTA
ACTGCTAAACTGCTTA
ACCCCTTA
ACCCCTTA
Tim
e
ACCCCATA
ACCCCTTAACCCCATAACTGCTTAACTGCTAA
![Page 18: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/18.jpg)
Sometimes not
ACCGCTTA
ACTGCTTA
ACTGCTAAACTGCTTC
ACCCCTTA
ACCCCTTC
Tim
e
ACCCCATA
ACCCCTTCACCCCATAACTGCTTCACTGCTAA
![Page 19: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/19.jpg)
How can we choose the best tree?
• Distance data– Quick methods that take a greedy clustering approach (e.g.
Neighbour-Joining)– Optimality criteria that assign each tree a score (Minimum evolution,
fastME)
• Character data (e.g. a sequence alignment)– Non-parametric methods (parsimony)– Model-based methods (maximum likelihood, Bayesian)
![Page 20: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/20.jpg)
Parsimony
To decide which tree is best we can use an optimality criterion. This means we need a way of assigning each possible tree a score.
Maximum Parsimony is one such criterion. It chooses the tree which requires the fewest substitutions to explain the data.
The Principle of Parsimony is the general scientific principle that accepts the simplest of two explanations as preferable.
![Page 21: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/21.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4)(1,3),(2,4)
1
2
3
4
1
3
2
4
![Page 22: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/22.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 0(1,3),(2,4) 0
A
A
A
A
A
A
A
A
![Page 23: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/23.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 001(1,3),(2,4) 002
C
C
T
T
C
T
C
T
C T
C
TT
C
![Page 24: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/24.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 0011(1,3),(2,4) 0022
C
C
G
G
C
G
C
G
C G
C
GG
C
![Page 25: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/25.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 001101(1,3),(2,4) 002201
T
A
T
T
T
T
A
T
T
A
A
T
![Page 26: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/26.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 0011011(1,3),(2,4) 0022011
T
T
T
A
T
T
T
A
T
A
T
A
![Page 27: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/27.jpg)
S1 ACCCCTTC S2 ACCCCATA S3 ACTGCTTC S4 ACTGCTAA(1,2),(3,4) 00110112 6(1,3),(2,4) 00220111 7
C
A
C
A
C
C
A
A
A
C
C
A
C A
According to the parsimony optimality criterion we should prefer the tree (1,2),(3,4) over the tree (1,3),(2,4) as it requires the fewest mutations.
![Page 28: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/28.jpg)
The “large parsimony problem”
The small parsimony problem – to find the score of a given tree - can be solved in linear time in the size of the tree (using the Fitch or Sankoff algorithm).
The large parsimony problem, finding the tree with minimum score, is known to be NP-Hard.
![Page 29: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/29.jpg)
How many trees are there?
#species #unrooted binary tip-labelled trees
4 3
5 3*5=15
6 3*5*7=105
7 3*5*7*9=945
10 2,027,025
20 2.2*1020
n (2n-5)!!
An exact search for the best tree, where each tree is evaluated according to some optimality criterion such as parsimony quickly becomes intractable as the number of species increases
![Page 30: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/30.jpg)
Counting trees
1
2 3
1 1 1
1
2
2 2
5 2
43
3
3 34 44
1 3 2
45
5 1 2
43
1 2 5
43
1 4 2
53
1
1 x 3 = 3
1 x 3 x 5 = 15
![Page 31: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/31.jpg)
Search strategies• Exact search: possible for small numbers of taxa (~12 or less) only
• Branch and Bound: A smarter way of doing exact search, up to ~20 taxa
• Greedy search: sequentially build up the tree with no backtracking (e.g. Neighbour-Joining)
• Local Search – Heuristics: pick a good starting tree and use moves within a “neighbourhood” to find a better tree.
• Meta-heuristics:– Genetic algorithms– Simulated annealing– The ratchet
![Page 32: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/32.jpg)
The “Felsenstein Zone” Parsimony was once a very popular method of phylogenetic inference but
it has fallen from grace in recent years due to the fact that it is statistically inconsistent for many relevant phylogenetic models.
Consistency in statistics is the property that as you get more and more data you are more and more likely to get the correct answer. i.e. in the case of phylogenetics this means that as you get longer and longer sequence alignments then you'd expect to be increasingly likely to infer the tree that these sequences evolved on.
![Page 33: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/33.jpg)
Felsenstein's example• The true tree is ((A,B),C,D)
• The only parsimony informative patterns are of the form: xxyy, xyxy, and xyyx
• Let p be the probability of mutation on a short edge and let q be the probability of mutation on a long edge
• Felsenstein showed that if p was small compared to q then the chance of seeing a parsimony informative pattern that “matched” (i.e. cost 1 on) the true tree, xxyy, was smaller than the chance of seeing a pattern, xyxy, that cost 1 on the incorrect tree ((A,C),B,D)
• For large datasets Parsimony would be guaranteed to get the incorrect tree
A C
B D
A C
B D
True tree
Long branch attraction tree
![Page 34: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/34.jpg)
Inconsistency with a clock
Felsenstein’s 4 taxon example appeared in a 1978 paper “Cases in which parsimony or compatibility methods will be positively misleading.” Syst. Zool.
The “misleading zone” was conjectured to depend on the highly non-clocklike nature of Felsenstein’s tree.
In 1989 Hendy and Penny showed that clocklike trees could also suffer from the same effect.
Outgroup1 2 3 4
The 4 trees ((1,out),(3,4),2) and ((2,out),(3,4),1) and ((3,out),(1,2),4) and ((4,out),(1,2),3) all have lower expected parsimony score than the true tree ((1,2),(3,4),out)
![Page 35: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/35.jpg)
Enter statistical phylogenetics
• Over the last 30 years model-based methods of phylogenetic inference have come to the fore.
• These methods assume a model of nucleotide substitution which specifies the rates that nucleotides (A,C,G,T) mutate into other nucleotides.
• They aim to find the tree with the highest probability of generating the sequence data observed, i.e. the maximum likelihood tree
![Page 36: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/36.jpg)
Sequence evolution is modelled as a Markov process
time
A
C
Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}.
The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written pij(t).
A
T
G
time
t
![Page 37: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/37.jpg)
Continuous time Markov chains
A C G T
A pAA pAC pAG pAT
C pCA pCC pCG pCT
G pGA pGC pGG pGT
T pTA pTC pTG pTT
M =
A C G T
A -qA* qAC qAG qAT
C qCA -qC* qCG qCT
G qGA qGC -qG* qGT
T qTA qTC qTG -qT*
Q = Where qi* = Σj qij, j ≠ i
i.e. rows sum to zero.
M = exp(Qt)
Transition matrix Instantaneous rate matrix
Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric.
![Page 38: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/38.jpg)
Example with a simple model
2-state symmetric model (R := AG purines, Y := CT pyrimidines)
Each edge i has probability pi of a mutation determined by its length
a …R…b …R…c …Y…d …Y…
a
b
c
d
0.1
0.1
0.20.15
0.1
R Yα
![Page 39: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/39.jpg)
Example with a simple model
a …R…b …R…c …Y…d …Y…
R
R
Y
Y
0.1
0.1
0.20.15
0.1
L = 0.9*0.9*0.8*0.15*0.1 + 0.9*0.9*0.2*0.85*0.9 + 0.1*0.1*0.2*0.15*0.1 + 0.1*0.1*0.8*0.85*0.9
R
R
Y
YR
R
R
Y
Y
R
R
Y
Y
R
R
Y
Y
? ?
RR RY Y Y Y
![Page 40: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/40.jpg)
In theory you can then…
• Multiply the likelihoods over all sites to get the likelihood of a particular weighted tree (in practice you sum the log likelihoods).
• Optimise the edge lengths to find the maximum likelihood value the tree.
• Compare the maximum likelihood score of all trees (in practice you would probably use heuristic search).
• Report the weighted tree with the highest overall likelihood.
![Page 41: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/41.jpg)
Computational complexity
• Maximum likelihood is computationally complex.
• Fortunately the “tiny problem”, finding the likelihood of a site on a tree with fixed branch lengths, is solvable in linear time using Felsenstein’s pruning algorithm (dynamic programming)
• However, solving the “small problem”, i.e. computing the likelihood of a particular tree, is time consuming. Unlike parsimony, here the edge lengths matter and must be optimized for each tree.
• Finding the best edge weights for a given tree uses local search (hill-climbing), and can get stuck in local optima.
![Page 42: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/42.jpg)
Models of nucleotide substitution
• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
α
α
αα
α
α
![Page 43: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/43.jpg)
Models of nucleotide substitution• Jukes Cantor (JC)
– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
β
αα
α
α
β
![Page 44: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/44.jpg)
Models of nucleotide substitution• Jukes Cantor (JC)
– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
β
αα
α
α
β
![Page 45: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/45.jpg)
Models of nucleotide substitution• Jukes Cantor (JC)
– All substitutions equally likely– Base frequencies equal
• Kimura 2 Parameter (K2P)– Transitions and transversions at
different rates– Base frequencies equal
• HKY model– Transitions and transversions at different rates– Base frequencies different
• General Time Reversible (GTR)
A G
C T
α
ζ
εγ
β
δ
![Page 46: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/46.jpg)
Extra features that can be modelled
• Site to site rate variation (usually modelled by a gamma distribution)• Invariant sites• Different parameters for different genes/codon positons
BUT• Some parts of reality are problematic…
– Base composition bias (LogDet)– Sites that are free to vary change across the tree– Non independence of sites
![Page 47: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/47.jpg)
Stochastic versus Systematic error
• Rather than just getting a point estimate of the tree we are usually interested in getting some measure of confidence
• Error could arise in two ways– Stochastic – our sequences are too short– Systematic – our models are wrong
• There are good methods for determining whether or not stochastic error is a problem.
![Page 48: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/48.jpg)
Assessing confidence in trees • We would like some measure of confidence in the inferred tree.
– Is the tree likely to change if we got more data, or if we had used slightly different data?
– Are some parts of the tree more robust than others?
• The bootstrap is a useful tool for answering these sorts of questions.
![Page 49: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/49.jpg)
The bootstrap
• In 1985 Felsenstein introduced the idea of the bootstrap to phylogenetics.
• For each bootstrap sample– Create a new alignment by resampling the columns of the observed
alignment– Construct a tree for the ‘bootstrap’ alignment
• Can be applied to any method that starts from a sequence alignment, e.g., parsimony, likelihood, clustering methods if the distances are derived from an alignment…
• The bootstrap support for each edge is the number of bootstrap trees that edge appears in.
![Page 50: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/50.jpg)
1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT
1224567a ATTTAAAb ATTATAAc TAAAATAd TAAAAAT
1334567a AAATAAAb ATTATAAc TAAAATAd TTTAAAT
1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT
1244567a ATTTAAAb ATAATAAc TAAAATAd TAAAAAT
a a a a
b
b
b b
c
c
c c
d d d d
a
b
c
d
0.75
![Page 51: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/51.jpg)
Example where the bootstrap is useful
• Simulate data on the four taxon tree below (JC model)
• Use sequence lengths of 100, 1000, and 10000
100 1000 10000((a,b),(c,d)) 5.7% 97% 100%((a,c),(b,d)) 42.8% <5% 0((a,d),(b,c)) 49.8% <5% 0
0.2
0.01
a b c d
![Page 52: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/52.jpg)
Example where the bootstrap is not so useful
• Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences
• Use total sequence lengths of 100, 1000, and 10000
100 1000 10000((a,b),(c,d)) 64% 80% 98%((a,c),(b,d)) 33% 20% <5((a,d),(b,c)) 3% 0% <5
0.1
0.05
a b c d
0.1
0.05
a c b d
55%
45%
![Page 53: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/53.jpg)
Mistaking precision for accuracy106 nuclear genes: Different methods provide conflicting Yeast topologies, each with 100% bootstrap support
The results show the importance of systematic error
Phillips et al. (MBE, 2004)
![Page 54: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/54.jpg)
Phylogenomics
• With increasing quantities of sequence data it is common to get very well resolved trees, i.e. bootstrap values (or posterior probabilities) close to 100% (or 1)
• HOWEVER, slight changes to the model or inference method used can mean you get 100% support for different trees?!
• This suggests that it is very important to know how well our models fit our data.
![Page 55: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/55.jpg)
Testing model fit
• In phylogenetic studies we often ask questions of the form “Is model A better than model B?” (relative goodness-of-fit tests, based on likelihood ratio tests or information criteria such as the AIC or BIC)
• We less frequently ask questions of the form “Is model C a good fit to the data?” (absolute goodness-of-fit tests)
• The alst question is clearly important as if our models fit poorly our results may be subject to systematic errors
• So why is the last question less popular?
![Page 56: Introduction to Phylogenetics - Barbara Holland](https://reader035.vdocuments.us/reader035/viewer/2022062319/55504e73b4c905b2788b5228/html5/thumbnails/56.jpg)
Absolute GOF tests
• Absolute tests of goodness-of-fit are fiddlier to implement than relative tests. They typically require doing many parametric simulations under the best model and then summarizing each simulated dataset by a single test statistic.
• With very large datasets you almost always will be able to reject your model.
• If the best model fails an absolute GOF test, it’s not clear what you should do next…