Phylogenetic inference
or “How to recognize a tree from quite a long way away”
Phylogenetic inference
or “How to recognize a tree from quite a long way away”
Mikael Thollesson
Evolutionary Biology Centre, Uppsala University
Slides are available on the course’s web page
“Bioinformation in the cell”
DNA
RNA
mRNA
protein
polypeptide
enzyme
transcriptionsplicing
translation
protein folding
coenzym activation
“Extended bioinformation”
Original sense strand
Original anti-sense strand
Original sense strand
Original anti-sense strand
New sense strand
New anti-sense strand
Phylogeny from a Bioinformatic viewpoint
A phylogeny is the (event) history more or less exclusively shared by some kind of biological replicators
These replicators can in practice be for example– Species, population, strains– Genomes, genes– Populations
Phylogenies can usually be modelled as trees; phylogeny and phylogenetic tree has thus become more or less synonymous, even though it is not
The objective for phylogenetic analysis is to infer these history and events, usually resulting in a phylogenetic hypothesis in the form of a tree (together with cosmology the only science dealing with particular histories)
000
010
111
110
GCCACTTTCGCGATCA
GCgACTTTCGCGATtA
GCgACTagCGCGATCA
GCgACTagCGCGATCA
GCCACaTTCcCGATCA
GCCACaTTCcCGAgCA
GCgACTTTCGCGATta
GCgACTTTCcCGATtA
GCgACTTTCGCGATCA
GCCACaTTCcCGATCA
Time
?GCgACTTTCGCGATta GCgACTTTCcCGATtA
GCgACTagCGCGATCA
GCCACaTTCcCGATCA
GCCACaTTCcCGAgCA
GCgACTTTCGC--Tta
Ordering the sequences hierarchically after shared evolutionary novelties, synapomorphies, produce a phylogenetic hypothesis (tree)
We can not distinguish between novelties and ancestral state, just see the difference
Parallel substitutions and multiple substitution at the same site creates ambiguities about the hierarchy We must make some a priori assumption of homology – for sequences, this is the same as doing a
multiple alignment
limbs amnionLion yes yesBald eagle yes yesBullfrog yes no Cod no noWhite shark no no
Lion
Bald eagle
BullfrogCod
White shark
Characters
Character statesTaxa,Terminal units
Lineus geniculatus TGGGCTGGGATGAAGGGAAGTATCGTGGGCCCGGMicrura akkeshiensis GGGGCTAGAATGAATGGGA-TAACGAGCCCCCGAMyoisophagus versicolor GGGGCTAGAATGAAAGAAA-GTTTGAGACCTCATParvicirrus dubius GGGACTGGAATGAAAGAAA-TTTTGAGGCCTTAA
1. Gather data from the entities whose phylogeny we are interested in
2. Select a criterion to evaluate how well each possble tree fits the observed data
Pd Mv Ma Lg Pd Ma Mv Lg Pd Lg Ma Mv
Mv Pd Ma Lg Mv Ma Pd Lg Mv Lg Ma Pd
Ma Mv Pd Lg Ma Pd Mv Lg Ma Lg Pd Mv
Lg Mv Ma Pd
Lg Ma Mv Pd
Lg Pd Ma Mv
Pd Lg Ma Mv
Pd MaLg Mv
Pd MvMa Lg
26 2222
22 22
22 27 27 27
27
27
2626
26
26
Mi. akkeshiensis
Li. geniculatus
Pa. dubius
My. versicolor
3. Find the tree that best fit the data and choose it to be the preferred hypothesis
22 4. Evaluate the sampling variation in the data to see if you have enough support for your conclusion
95%
Why do phylogenetics? – Prediction
Prospective biomedical compounds from sponges (Porifera)
Treatment of microsporidia Gauging biodiversity for
conservation
“Taxa are not related because of similarity, but similar due to relatedness”
Why? –Sequence of evolutionary events
Why the oaks retain their leaves in contrast to other deciduous trees
Evergreens
Evolution of metabolic pathways
Tracing infection histories for virus
Why? – (Ab)use of comparative method
Correlation between ability to fly and being black and white
Species, populations, or genes (i.e., entities corresponding to replicators) are not independent samples/observations since they have a more or less inclusively shared history
Trees and terminology
A
B
C
D
branchor edge
nodeor vertex
Terminal nodes (external vertices) represent taxa or genes on which we have observations
Internal vertices represent inferred splitting events (may be interpreted as ancestral species or gene copies)
Unrooted vs. rooted trees
D C A B
clade
Rooting is normally done using a designated outgroup
D D
C
BA
e1e2
e6e3
e4 e5
A
B
C
D
X is defined to be more closely related to Y than to Z if and only if X shares a (more recent) history with Y that it does not share with Z
D C A BB A C D
Relatedness
“The standard recipe” for phylogenetic inference
Collect your data Select an optimality criterion (“Which tree is better”?) Optional: do data transformations (“corrections”) Select a search strategy and find the best hypothesis
(according to selected criterion) using this search method Assess the variation in your data in some way
There are really only two big theoretical problems in phylogenetic inference…– The criterion and calculating the score– Finding the best tree
Step 1 – Data collection
Any observation of inherited traits is in principle useful
Primary homology assessment - from traits to characters and character states; for sequence data this corresponds to alignment
Pair-wise differences (e.g., DNA-DNA hybridization, histocompatibility) can also be used, although with a limited set of criteria
Include one or several outgroups for rooting
Step 2 – Optimality criteria, some selected
Criterion Data Model based
(Maximum)parsimony
Discrete No
Maximum likelihood Discrete orcontinuous
Yes
Minimum evolution Pairwisedistances
Optional*
Assumptions in shared by (almost) all optimality criteria/methods
Characters are independent (and thus the order in the data matrix does not matter)– Special models for e.g., rRNA and codons
The substitution process is homogenous over time/in the entire tree (overall rate can vary)– Special models do not make this assumption
Substitution rates are the same for all characters– Can be accommodated easily in most methods
Parsimony optimality criterion
Given two trees, the one requiring the lowest number of character changes necessary to explain the observed character distribution is the better– Parsimony score for a tree is the minimum number of
required changes– This score is frequently referred to as number of steps
or tree length– The method can be modified using non-uniform
weights Character weights (positional weights) Character state weights (transformational weights)
Parsimony – an example acgtatgga acgggtgca aacggtgga aactgtgca
: c
: c
: a
: a
: c
: c
: a
: a
: c
: a
: a
: c
Total tree length: 7 Total tree length: 8 Total tree length: 8
€
L(τ ) = ljwjj =1
c
∑
€
l = min ca(k ), b(k )k =1
2n−3
∑
Using substitution models – Why?
Observed differences
Actual changes
€
Q =
−3α α α α
α −3α α α
α α −3α α
α α α −3α
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
Example: Jukes-Cantor model
pij =14
−14e−4αt
pij =14
+34e−4αt
, if i=j
, if i≠j
Jukes-Cantor is the simplest model in a class of models called time-reversible (GTR) models for DNA
GTR (most complex symmetric model) has six different rates (one for each pair of bases) and different base frequencies
A G
TC
P(t)=eQt
Pair-wise distances – an example
acgtatggac acgggtgcac aacggtggac aactgtgcac
0 0.3 0.4 0.4
0.3 0 0.3 0.3
0.4 0.3 0 0.2
0.4 0.3 0.2 0
p= p=2/10=0.2 (p distance)
€
dij = −3
4ln 1−
4
3pij
⎛
⎝ ⎜
⎞
⎠ ⎟ – Jukes-Cantor distance
Minimum evolution optimality criterion
Starts by calculating pair-wise distances between all terminal taxa/sequences
– These calculations can incorporate explicit substitution models, e.g., Jukes-Cantor
Given two trees, the one having the lowest sum of branch lengths when fitted to the data, is the better
One way to fit the data is using the constraints below, or using least squares approximation
– No branch can have negative length, eij≥0– The path between two terminals along the tree is at least
as long as the pair-wise distance, eij≥dij
The score is commonly referred to as tree length (as for parsimony)
Maximum likelihood optimality criterion Given two trees, the one with the higher likelihood, i.e. the
one with the higher conditional probability of observing the data, is the better
– Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model
– Data set likelihood is the product of the site likelihoods (character independence)
Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model
The model consists of– A substitution model, e.g. Jukes-Cantor– A tree with branch lengths
LH ∝ P(D |H) =P(D |T,Θ)
For Jukes-Cantor!
€
L1 = Pr(A)Pr(A → C |αt) =1
4pAC (t) =
1
4
1
4−
1
4e−4αt ⎡
⎣ ⎢ ⎤ ⎦ ⎥
Likelihood of a one-branch tree
L2 =Pr(C)Pr(C→ C |αt) =14pCC(t) =
14
14
+34e−4αt⎡
⎣ ⎤ ⎦
Taxon1 ACTaxon2 CC
Ltot=L1·L2, or log Ltot = logL1+logL2
Taxon1AC
Taxon2CC
t
30 nucleotides from -globin genes of two primates on a one-edge tree * *
Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGGOrangutan GGACTCCTTGAGAAATAAACTGCACACTGG
There are two differences and 28 similarities
L =1
161−e−4αt( )
⎡ ⎣ ⎢
⎤ ⎦ ⎥
2 116
1+3e−4αt( )
⎡ ⎣ ⎢
⎤ ⎦ ⎥
28
-55.0
-54.5
-54.0
-53.5
-53.0
-52.5
-52.0
-51.5
-51.0
-50.5
0 0.02 0.04 0.06 0.08 0.1 t
lnL
t= 0.02327lnL= -51.133956
Another one-branch tree
Likelihoods of a more interesting tree…
Bases at internal nodes are unknown
€
Li =1
4(p(u → A | e1)
v∈A ,C ,G,T
∑u∈A ,C ,G,T
∑ ( p(u → A | e2)(p(u → v | e5)(p(v → C | e3)(p(v → C | e4 )
A
A
C
T
e1 e3
e4e2
e5
u v
Step 3 – Finding the best tree
Number of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1
– 3 taxa -> 3 trees– 4 taxa -> 15 trees– 10 taxa -> 34 459 425 trees– 25 taxa -> 1,19·1030 trees– 52 taxa -> 2,75·1080 trees
Finding the optimal tree is an NP-complete or NP-hard problem
Search strategies– Exact
Will find the best (according to selected criterion) tree
Exhaustive– Up to ca 10 taxa
Branch and bound– Up to ca 15 taxa
– HeuristicLimits the search to a
“reasonable” set of trees. May not find the optimal tree
Heursitic tree searches usually
start with hill climbing (greedy algorithms) to obtain a starting tree– Star decomposition– Stepwise addition
and proceed with some flavour of branch swapping to improve on the starting tree and find better trees
Heursitic tree search – Star decomposition
B
A
CD
E
B
A
C
D
E
B
E
C
D
A
A
B
C
D
E
C
A
E
BD
C
A
D
BE
…
E
A
B
DC
C
B
D
AE
Heursitic tree search – Stepwise addition
C
A
B
A
B
C
D
921A
B
C
D
E
914A
D
C
E
B
915A
C
D
E
B
905B
A
C
D
E
916B
E
C
D
A
831
A B
CD783
A
B
C
D
837
A
C
B
D
Heursitic tree search – Branch swapping
F
A
B
G
E
D
I
C
H
G
I
H
F
A
B
E
DC
A
G
I
H D
C
F
EB
SPR
A
I
G
H
DC
F
E
BF
B
A
E
DC
G
I
H
CA
G
H
I D
F
EB
AG
I
H
D
C
FE
B
TBR
Step 2+3 – A dirty shortcut to get a tree…
Instead of evaluating each tree, some methods build a tree using a specific algorithm, usually from pair-wise distances
Neighbor-joining is such a methods that is widely used– NJ can roughly be viewed as a star decomposition
minimizing the sum of branch lengths (evolutionary change)
What is a “good” method?
Efficiency Power Consistency Robustness Falsifiability
– Time to find a/the solution
– Rate of convergence/how much data are needed
– Convergence to “correct” solution as data are added
– Performance when assumptions are violated
– Rejection of the model when it is inadequate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 100 1000 10000 100000
Lakes invariants Parsimonny, uniformUPGMA, Kimura NJ, KimuraML, Kimura Parsimony, weighted
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 100 1000 10000 100000
UPGMA, Kimura
NJ, percentage
Parsimony, uniform
Parsimony,weightedNJ, Kimura
ML, Kimura
Frequency of correct inference
Sequence length
All 0.50
0.30 and0.05 respectively
Performance on simulated data
Some pros and cons of selected methods Pair-wise, algorithmic approach (eg. Neighbor-joining)
+ Fast
+ Models can be used when transforming to distances
- Information is lost when transforming to pair-wise distances
- One will get a tree, but no measure of goodness to compare with other hypotheses (when using algorithmic methods like NJ)
Parsimony+ Philosophically appealing – Occam’s razor (no unnecessary assumptions)
+ Can be applied to most kinds of data without prior knowledge
- Can be inconsistent
- Can be computationally slow Maximum likelihood
+ Model based; enables statistical tests and handles problems with multiple substitutions
- Model based; models can be inadequate and give misleading results
- Computationally veeeeery slooooowww
Step 4 – Assessing the variation in the data
Variation can not be assessed by repeated sampling from the statistical population – we have a unique sample
We have to rely on resampling from the data already at hand– Jack-knife – resampling without replacement– Bootstrap – resampling with replacement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Aus C G A C G G T G G T C T A T A C A C G ABeus C G G C G G T G A T C T A T G C A C G GCeus T G G C G G C G T C T C A T A C A A T ADeus T A A C G A T G A C C C G A C T A T T G
Original data set with n characters.
2 3 13 8 3 19 14 6 20 20 7 1 9 11 17 10 6 14 8 16Aus G A A G A G T G A A T C G C A T G T G CBeus G G A G G G T G G G T C A C A T G T G CCeus G G A G G T T G A A C T T T A C G T G CDeus A A G G A T A A G G T T A C A C A A G T
Draw n characters randomly with re-placement. Repeat m times.
m pseudo-replicates, each with n characters.
Aus
Beus
Ceus
Deus
Original analysis, e.g. MP, ML, NJ.
Aus
Beus
Ceus
Deus
75%
Evaluate the results from the m analyses.
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Repeat original analysis on each of the pseudo-replicate data sets.
Bootstrap
Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support
Valules below 0.5 are non-sense
What can go wrong?
Sampling error (i.e., due to finite data)– Assessed by - for example - the bootstrap
Systematic error (inconsistent method)– Tests of the adequacy of models used– Using different methods with different properties and
compare the results Inadequate tree search (heuristics) Reality
– A tree may be a poor model of the real history– Information has been lost by subsequent evolutionary
changes “Species” vs. “gene” trees
Canis MusGadus
What is wrong with this tree?
Negligible (within sequence) sampling error – high bootstrap values
100
100 Tree estimated by a
consistent method
Gene duplication
“Species” tree
“Gene” trees
The expected tree…
Canis Mus Gadus Gadus Mus Canis
Two copies (paralogs) present in the genomes
Paralogous
Orthologous Orthologous
Canis Gadus Mus
What we have actually studied…
To detect a paralogy problem, several different genes can be used to infer the “species” phylogeny
To conclude– Phylogenetic inference deals with historical events and
information transfer – the evolutionary history Results from phylogenetic analyses are hypotheses for
further testing; the true history will remain unknown Inference is mathematically intricate and computationally
heavy, and as a result methods for phylogenetic inference are legio. A good place to start looking for software is http://evolution.genetics.washington.edu/phylip/software.html
There are several pitfalls to avoid when doing the analyses and when interpreting them – and most of the problems are data dependent…
But… Phylogenies have great explanatory power (the only we have to predict properties of organisms), and ignoring the shared histories can sometimes give completely bogus results in comparative studies