current challenges for molecular phylogenetics barbara holland school of mathematics & physics...

Current challenges for molecular phylogenetics

Barbara Holland

School of Mathematics & Physics

University of Tasmania

Mostly statistical

Charles Darwin and Alfred Russell Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines

Phylogenetics seeks to determine these genetic relationships

Darwin’s sketch: the first phylogenetic tree? Charles Darwin

Alfred Russel Wallace

Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “tree of life”.

Ernst Haeckel’s Tree of Life (1866)

Haeckel’s Pedigree of Man

Why molecular phylogeny

• Most molecules evolve independently of adaptations affecting morphology.

• It is fairly easy to find genes that are present in all species of interest, e.g., a 12S RNA molecule in mitochondria is functional over all mammals.

• Useful mathematical models of sequence evolution have been developed that underpin attempts to infer evolutionary trees

Golden Mole

Mole

Whale?

Golden Mole Mole Whale

LaurasiatheriaAfrotheria

?

hedgehog

tenrec

elephant

LaurasiatheriaAfrotheria

hedgehogtenrecelephant

A brief and incomplete history of molecular phylogenetics

60s

70s

80s

90s

00s

Antibodies

DNA-DNA hybridisationSarich Wilson

Assessing support - BootstrapExplicit Models - Maximum likelihood Systematic bias – Felsenstein Zone

Felsenstein

Sequence Data (Amino acid then DNA)

Parsimony

Distance based

More complex models - Bayesian methods

Various perils...anomalous gene trees, non identifiable models

Population processes, gene trees in species trees

MORE Sequence Data

PCR

The molecular phylogeny problem

ACCGCTTA

ACTGCTTA

ACTGCTAAACTGCTTA

ACTGCTAA

ACCCCTTA

ACCCCTTA

Tim

e

ACCCCATA

…ACCCCTTA……ACCCCATA……ACTGCTTA……ACTGCTAA…

We see the alignedmodern day sequences

And want to recover theunderlying evolutionarytree.

?

Sequence evolution is modelled as a Markov processtim

e

A

C

Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}.

The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written pij(t).

A

T

G

time

t

Continuous time Markov chains

A C G T

A pAA pAC pAG pAT

C pCA pCC pCG pCT

G pGA pGC pGG pGT

T pTA pTC pTG pTT

M =

A C G T

A -qA* qAC qAG qAT

C qCA -qC* qCG qCT

G qGA qGC -qG* qGT

T qTA qTC qTG -qT*

Q = Where qi* = Σj qij, j ≠ i

i.e. rows sum to zero.

M = exp(Qt)

Transition matrix Instataneous rate matrix

Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric.

Models of nucleotide substitution

• Jukes Cantor (JC)– All substitutions equally likely– Base frequencies equal

• Kimura 2 Parameter (K2P)– Transitions and transversions at

different rates– Base frequencies equal

• HKY model– Transitions and transversions at different rates– Base frequencies different

• General Time Reversible (GTR)

A G

C T

α

α

ααα

α







A G

C T

β

β

ααα

α







A G

C T

β

γζα

ε

δ

Models define probability distributions on site patterns

The model θ consists of: the tree topology, edge weights, Q matrix*, and root distribution π.

*More generally, this could be a set of Q matrices

1 2 3

M1 M2

M12

M3

Edge weights t1, t2, t3, t12

Me = exp(Qte)

pijk = Σx,y M1(x,i) M2(x,j) M12(y,x) M3(y,k) π(y)

x

y

Tree estimation using maximum likelihood

• For a given set of parameters θ we can calculate the probability of any particular site pattern.

• The overall probability of an alignment is then taken to be the the product of the probabilities for each site (i.i.d assumption).

• This is the likelihood function, i.e. the probability of the data given the model.

• We can then use optimisation techniques to find the model parameters (tree topology, edge lengths, parameters of the substitution model) that maximise the likelihood.

Extra features of sequence evolution that can be modelled

• Site to site rate variation (usually modelled by a gamma distribution)

• Invariant sites

BUT• Some parts of reality are problematic…

– Base composition bias – Sites that are free to vary change across the tree– Non independence of sites

Likelihood versus parsimony(the Felsenstein Zone)

Prior to the introduction of ML to phylogenetics community by Joe Felsenstein Maximum Parsimony (MP) was the most widely used method for estimating phylogenetic trees.

MP chooses the tree that requires the fewest mutations to explain the data

A

B

C

D

A

A

G

G

G

A

A

D

BA

G

G

A

A

A

C

Likelihood versus parsimony(the Felsenstein Zone)

The MP criterion has been shown to be statistically inconsistent on some trees under the models of nucleotide substitution discussed previously. Likelihood is statisitically consistent (given the correct model).

Felsenstein (1978) Hendy & Penny (1989)

Assessing confidence

• It is not just of interest to get a point estimate of the phylogenetic tree.

• We would also like some measure of confidence in our point estimate.– Is our tree likely to change if we get more data?– How robust is our result to sampling error?

• The bootstrap is a useful tool for answering these sorts of questions.

The bootstrap (Felsenstein 1985)

• For each bootstrap sample:– Create a new alignment (of the same length as

the original) by resampling the columns of the observed alignment

– Construct a tree for the ‘bootstrap’ alignment

• The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT

1224567a ATTTAAAb ATTATAAc TAAAATAd TAAAAAT

1334567a AAATAAAb ATTATAAc TAAAATAd TTTAAAT

1234567a ATATAAAb ATTATAAc TAAAATAd TATAAAT

1244567a ATTTAAAb ATAATAAc TAAAATAd TAAAAAT

a a a a

b

b

b b

c

c

c c

d d d d

a

b

c

d

0.75

Example where the bootstrap is useful

• Simulate data on the four taxon tree below (JC model)

• Use sequence lengths of 100, 1000, and 10000

100 1000 10000((a,b),(c,d)) 5.7% 97% 100%((a,c),(b,d)) 42.8% <5% 0((a,d),(b,c)) 49.8% <5% 0

0.2

0.01

a b c d

Example where it is not so useful

• Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences

• Use total sequence lengths of 100, 1000, and 10000

100 1000 10000((a,b),(c,d)) 64% 80% 98%((a,c),(b,d)) 33% 20% <5((a,d),(b,c)) 3% 0% <5

0.1

0.05

a b c d

0.1

0.05

a c b d

55%

45%

Genome-scale phylogeny

• Data sets with many concatenated genes

– Rokas et al, Nature 2003 (106 genes, 8 taxa)

– Goremykin et al, MBE 2004 (61 genes, 14 taxa)

• Estimated trees have very high bootstrap support.

• BUT... trees are sensitive to: model used, method used, data-coding.

Case study: The Amborella Wars

Angiosperm

s

Grasses

A New Caladonianshrub

NJ bootstrap with ML distances using a GTR + gamma model

0

20

40

60

80

100

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75

alpha (gamma shape parameter)

bo

ots

trap

su

pp

ort

Amb+Nym

Grasses

Amb

Nym

Skewed rates Equal rates

Sensitivity to model choice

• Phylogenomic datasets may involve hundreds of genes for many species.

• These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes.

• One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.

Example

C. albicans

S. kluyveri

S. kudriavzevii

S. bayanus

S. cerevisiae

S. paradoxus

S. mikatae

S. castellii

Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa

Two extremes

• How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ?Between

13 (consensus tree) &

13 x 106 = 1378• Too few parameters introduces bias• Too many parameters increases the variance

Stochastic partitioning

• Attempts to cluster genes into classes that have evolved in a similar fashion.

• Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution)

Algorithm overview

1. Randomly assign the n genes to k classes.2. Optimise parameters for each class3. Compute the posterior probability for each

gene with the parameters from each class.4. Move each gene into the class for which it has

highest posterior probability5. Go to step 2, when no genes change class

STOP

How many classes?

Conclusions regarding stochastic partitioning

• Pros– AIC/BIC allows you a quantitative method to choose

how many parameters are needed.– Identifies groups of genes under similar constraints

• Cons– Slow– Randomized algorithm so different starting points lead

to different partitions.

Brief Tour…

Combinatorics of tree spaceGraph TheoryStochastic Models, Inference & Probability TheoryAlgebraic GeometryLie groups, representation theory….….

Figure 2 Matsen and Steel (2007)

…the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology.

Identifiability

Elizabeth Allman John Rhodes

Algebraic geometry approach

The boundary of phylogenetics and population genetics

Fisher-Wright model Phylogenetic tree

Gene trees in species phylogenies

James Degnan Noah Rosenberg

Representation theory, Lie groups, Markov invariants, closure of model

classes

Jeremy Sumner Peter Jarvis

http://www.maths.utas.edu.au/phylomania/phylomania2011.htm

current challenges for molecular phylogenetics barbara holland school of mathematics & physics...

Documents

q ac q ag q

q gt t q ta q tc q tg

c q ca q c

q cg q ct g q ga q gc

p aa p ac p ag p

j q ij

statistical slide

t g time t