Download - Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides

Phylogenetic inference

or “How to recognize a tree from quite a long way away”

Phylogenetic inference

or “How to recognize a tree from quite a long way away”

Mikael Thollesson

Evolutionary Biology Centre, Uppsala University

Slides are available on the course’s web page

“Bioinformation in the cell”

DNA

RNA

mRNA

protein

polypeptide

enzyme

transcriptionsplicing

translation

protein folding

coenzym activation

“Extended bioinformation”

Original sense strand

Original anti-sense strand

Original sense strand

Original anti-sense strand

New sense strand

New anti-sense strand

Phylogeny from a Bioinformatic viewpoint

A phylogeny is the (event) history more or less exclusively shared by some kind of biological replicators

These replicators can in practice be for example– Species, population, strains– Genomes, genes– Populations

Phylogenies can usually be modelled as trees; phylogeny and phylogenetic tree has thus become more or less synonymous, even though it is not

The objective for phylogenetic analysis is to infer these history and events, usually resulting in a phylogenetic hypothesis in the form of a tree (together with cosmology the only science dealing with particular histories)

000

010

111

110

GCCACTTTCGCGATCA

GCgACTTTCGCGATtA

GCgACTagCGCGATCA

GCgACTagCGCGATCA

GCCACaTTCcCGATCA

GCCACaTTCcCGAgCA

GCgACTTTCGCGATta

GCgACTTTCcCGATtA

GCgACTTTCGCGATCA

GCCACaTTCcCGATCA

Time

?GCgACTTTCGCGATta GCgACTTTCcCGATtA

GCgACTagCGCGATCA

GCCACaTTCcCGATCA

GCCACaTTCcCGAgCA

GCgACTTTCGC--Tta

Ordering the sequences hierarchically after shared evolutionary novelties, synapomorphies, produce a phylogenetic hypothesis (tree)

We can not distinguish between novelties and ancestral state, just see the difference

Parallel substitutions and multiple substitution at the same site creates ambiguities about the hierarchy We must make some a priori assumption of homology – for sequences, this is the same as doing a

multiple alignment

limbs amnionLion yes yesBald eagle yes yesBullfrog yes no Cod no noWhite shark no no

Lion

Bald eagle

BullfrogCod

White shark

Characters

Character statesTaxa,Terminal units

Lineus geniculatus TGGGCTGGGATGAAGGGAAGTATCGTGGGCCCGGMicrura akkeshiensis GGGGCTAGAATGAATGGGA-TAACGAGCCCCCGAMyoisophagus versicolor GGGGCTAGAATGAAAGAAA-GTTTGAGACCTCATParvicirrus dubius GGGACTGGAATGAAAGAAA-TTTTGAGGCCTTAA

1. Gather data from the entities whose phylogeny we are interested in

2. Select a criterion to evaluate how well each possble tree fits the observed data

Pd Mv Ma Lg Pd Ma Mv Lg Pd Lg Ma Mv

Mv Pd Ma Lg Mv Ma Pd Lg Mv Lg Ma Pd

Ma Mv Pd Lg Ma Pd Mv Lg Ma Lg Pd Mv

Lg Mv Ma Pd

Lg Ma Mv Pd

Lg Pd Ma Mv

Pd Lg Ma Mv

Pd MaLg Mv

Pd MvMa Lg

26 2222

22 22

22 27 27 27

27

27

2626

26

26

Mi. akkeshiensis

Li. geniculatus

Pa. dubius

My. versicolor

3. Find the tree that best fit the data and choose it to be the preferred hypothesis

22 4. Evaluate the sampling variation in the data to see if you have enough support for your conclusion

95%

Why do phylogenetics? – Prediction

Prospective biomedical compounds from sponges (Porifera)

Treatment of microsporidia Gauging biodiversity for

conservation

“Taxa are not related because of similarity, but similar due to relatedness”

Why? –Sequence of evolutionary events

Why the oaks retain their leaves in contrast to other deciduous trees

Evergreens

Evolution of metabolic pathways

Tracing infection histories for virus

Why? – (Ab)use of comparative method

Correlation between ability to fly and being black and white

Species, populations, or genes (i.e., entities corresponding to replicators) are not independent samples/observations since they have a more or less inclusively shared history

Trees and terminology

A

B

C

D

branchor edge

nodeor vertex

Terminal nodes (external vertices) represent taxa or genes on which we have observations

Internal vertices represent inferred splitting events (may be interpreted as ancestral species or gene copies)

Unrooted vs. rooted trees

D C A B

clade

Rooting is normally done using a designated outgroup

D D

C

BA

e1e2

e6e3

e4 e5

A

B

C

D

X is defined to be more closely related to Y than to Z if and only if X shares a (more recent) history with Y that it does not share with Z

D C A BB A C D

Relatedness

“The standard recipe” for phylogenetic inference

Collect your data Select an optimality criterion (“Which tree is better”?) Optional: do data transformations (“corrections”) Select a search strategy and find the best hypothesis

(according to selected criterion) using this search method Assess the variation in your data in some way

There are really only two big theoretical problems in phylogenetic inference…– The criterion and calculating the score– Finding the best tree

Step 1 – Data collection

Any observation of inherited traits is in principle useful

Primary homology assessment - from traits to characters and character states; for sequence data this corresponds to alignment

Pair-wise differences (e.g., DNA-DNA hybridization, histocompatibility) can also be used, although with a limited set of criteria

Include one or several outgroups for rooting

Step 2 – Optimality criteria, some selected

Criterion Data Model based

(Maximum)parsimony

Discrete No

Maximum likelihood Discrete orcontinuous

Yes

Minimum evolution Pairwisedistances

Optional*

Assumptions in shared by (almost) all optimality criteria/methods

Characters are independent (and thus the order in the data matrix does not matter)– Special models for e.g., rRNA and codons

The substitution process is homogenous over time/in the entire tree (overall rate can vary)– Special models do not make this assumption

Substitution rates are the same for all characters– Can be accommodated easily in most methods

Parsimony optimality criterion

Given two trees, the one requiring the lowest number of character changes necessary to explain the observed character distribution is the better– Parsimony score for a tree is the minimum number of

required changes– This score is frequently referred to as number of steps

or tree length– The method can be modified using non-uniform

weights Character weights (positional weights) Character state weights (transformational weights)

Parsimony – an example acgtatgga acgggtgca aacggtgga aactgtgca

: c

: c

: a

: a

: c

: c

: a

: a

: c

: a

: a

: c

Total tree length: 7 Total tree length: 8 Total tree length: 8

€

L(τ ) = ljwjj =1

c

∑

€

l = min ca(k ), b(k )k =1

2n−3

∑

Using substitution models – Why?

Observed differences

Actual changes

€

Q =

−3α α α α

α −3α α α

α α −3α α

α α α −3α

⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

Example: Jukes-Cantor model

pij =14

−14e−4αt

pij =14

+34e−4αt

, if i=j

, if i≠j

Jukes-Cantor is the simplest model in a class of models called time-reversible (GTR) models for DNA

GTR (most complex symmetric model) has six different rates (one for each pair of bases) and different base frequencies

A G

TC

P(t)=eQt

Pair-wise distances – an example

acgtatggac acgggtgcac aacggtggac aactgtgcac

0 0.3 0.4 0.4

0.3 0 0.3 0.3

0.4 0.3 0 0.2

0.4 0.3 0.2 0

p= p=2/10=0.2 (p distance)

€

dij = −3

4ln 1−

4

3pij

⎛

⎝ ⎜

⎞

⎠ ⎟ – Jukes-Cantor distance

Minimum evolution optimality criterion

Starts by calculating pair-wise distances between all terminal taxa/sequences

– These calculations can incorporate explicit substitution models, e.g., Jukes-Cantor

Given two trees, the one having the lowest sum of branch lengths when fitted to the data, is the better

One way to fit the data is using the constraints below, or using least squares approximation

– No branch can have negative length, eij≥0– The path between two terminals along the tree is at least

as long as the pair-wise distance, eij≥dij

The score is commonly referred to as tree length (as for parsimony)

Maximum likelihood optimality criterion Given two trees, the one with the higher likelihood, i.e. the

one with the higher conditional probability of observing the data, is the better

– Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model

– Data set likelihood is the product of the site likelihoods (character independence)

Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model

The model consists of– A substitution model, e.g. Jukes-Cantor– A tree with branch lengths

LH ∝ P(D |H) =P(D |T,Θ)

For Jukes-Cantor!

€

L1 = Pr(A)Pr(A → C |αt) =1

4pAC (t) =

1

4

1

4−

1

4e−4αt ⎡

⎣ ⎢ ⎤ ⎦ ⎥

Likelihood of a one-branch tree

L2 =Pr(C)Pr(C→ C |αt) =14pCC(t) =

14

14

+34e−4αt⎡

⎣ ⎤ ⎦

Taxon1 ACTaxon2 CC

Ltot=L1·L2, or log Ltot = logL1+logL2

Taxon1AC

Taxon2CC

t

30 nucleotides from -globin genes of two primates on a one-edge tree * *

Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGGOrangutan GGACTCCTTGAGAAATAAACTGCACACTGG

There are two differences and 28 similarities

L =1

161−e−4αt( )

⎡ ⎣ ⎢

⎤ ⎦ ⎥

2 116

1+3e−4αt( )

⎡ ⎣ ⎢

⎤ ⎦ ⎥

28

-55.0

-54.5

-54.0

-53.5

-53.0

-52.5

-52.0

-51.5

-51.0

-50.5

0 0.02 0.04 0.06 0.08 0.1 t

lnL

t= 0.02327lnL= -51.133956

Another one-branch tree

Step 3 – Finding the best tree

Number of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1

– 3 taxa -> 3 trees– 4 taxa -> 15 trees– 10 taxa -> 34 459 425 trees– 25 taxa -> 1,19·1030 trees– 52 taxa -> 2,75·1080 trees

Finding the optimal tree is an NP-complete or NP-hard problem

Search strategies– Exact

Will find the best (according to selected criterion) tree

Exhaustive– Up to ca 10 taxa

Branch and bound– Up to ca 15 taxa

– HeuristicLimits the search to a

“reasonable” set of trees. May not find the optimal tree

Heursitic tree searches usually

start with hill climbing (greedy algorithms) to obtain a starting tree– Star decomposition– Stepwise addition

and proceed with some flavour of branch swapping to improve on the starting tree and find better trees

Heursitic tree search – Star decomposition

B

A

CD

E

B

A

C

D

E

B

E

C

D

A

A

B

C

D

E

C

A

E

BD

C

A

D

BE

…

E

A

B

DC

C

B

D

AE

Heursitic tree search – Stepwise addition

C

A

B

A

B

C

D

921A

B

C

D

E

914A

D

C

E

B

915A

C

D

E

B

905B

A

C

D

E

916B

E

C

D

A

831

A B

CD783

A

B

C

D

837

A

C

B

D

Heursitic tree search – Branch swapping

F

A

B

G

E

D

I

C

H

G

I

H

F

A

B

E

DC

A

G

I

H D

C

F

EB

SPR

A

I

G

H

DC

F

E

BF

B

A

E

DC

G

I

H

CA

G

H

I D

F

EB

AG

I

H

D

C

FE

B

TBR

Step 2+3 – A dirty shortcut to get a tree…

Instead of evaluating each tree, some methods build a tree using a specific algorithm, usually from pair-wise distances

Neighbor-joining is such a methods that is widely used– NJ can roughly be viewed as a star decomposition

minimizing the sum of branch lengths (evolutionary change)

What is a “good” method?

Efficiency Power Consistency Robustness Falsifiability

– Time to find a/the solution

– Rate of convergence/how much data are needed

– Convergence to “correct” solution as data are added

– Performance when assumptions are violated

– Rejection of the model when it is inadequate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 100 1000 10000 100000

Lakes invariants Parsimonny, uniformUPGMA, Kimura NJ, KimuraML, Kimura Parsimony, weighted

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 100 1000 10000 100000

UPGMA, Kimura

NJ, percentage

Parsimony, uniform

Parsimony,weightedNJ, Kimura

ML, Kimura

Frequency of correct inference

Sequence length

All 0.50

0.30 and0.05 respectively

Performance on simulated data

Some pros and cons of selected methods Pair-wise, algorithmic approach (eg. Neighbor-joining)

+ Fast

+ Models can be used when transforming to distances

- Information is lost when transforming to pair-wise distances

- One will get a tree, but no measure of goodness to compare with other hypotheses (when using algorithmic methods like NJ)

Parsimony+ Philosophically appealing – Occam’s razor (no unnecessary assumptions)

+ Can be applied to most kinds of data without prior knowledge

- Can be inconsistent

- Can be computationally slow Maximum likelihood

+ Model based; enables statistical tests and handles problems with multiple substitutions

- Model based; models can be inadequate and give misleading results

- Computationally veeeeery slooooowww

Step 4 – Assessing the variation in the data

Variation can not be assessed by repeated sampling from the statistical population – we have a unique sample

We have to rely on resampling from the data already at hand– Jack-knife – resampling without replacement– Bootstrap – resampling with replacement

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Aus C G A C G G T G G T C T A T A C A C G ABeus C G G C G G T G A T C T A T G C A C G GCeus T G G C G G C G T C T C A T A C A A T ADeus T A A C G A T G A C C C G A C T A T T G

Original data set with n characters.

2 3 13 8 3 19 14 6 20 20 7 1 9 11 17 10 6 14 8 16Aus G A A G A G T G A A T C G C A T G T G CBeus G G A G G G T G G G T C A C A T G T G CCeus G G A G G T T G A A C T T T A C G T G CDeus A A G G A T A A G G T T A C A C A A G T

Draw n characters randomly with re-placement. Repeat m times.

m pseudo-replicates, each with n characters.

Aus

Beus

Ceus

Deus

Original analysis, e.g. MP, ML, NJ.

Aus

Beus

Ceus

Deus

75%

Evaluate the results from the m analyses.

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Repeat original analysis on each of the pseudo-replicate data sets.

Bootstrap

Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support

Valules below 0.5 are non-sense

What can go wrong?

Sampling error (i.e., due to finite data)– Assessed by - for example - the bootstrap

Systematic error (inconsistent method)– Tests of the adequacy of models used– Using different methods with different properties and

compare the results Inadequate tree search (heuristics) Reality

– A tree may be a poor model of the real history– Information has been lost by subsequent evolutionary

changes “Species” vs. “gene” trees

Canis MusGadus

What is wrong with this tree?

Negligible (within sequence) sampling error – high bootstrap values

100

100 Tree estimated by a

consistent method

Gene duplication

“Species” tree

“Gene” trees

The expected tree…

Canis Mus Gadus Gadus Mus Canis

Two copies (paralogs) present in the genomes

Paralogous

Orthologous Orthologous

Canis Gadus Mus

What we have actually studied…

To detect a paralogy problem, several different genes can be used to infer the “species” phylogeny

To conclude– Phylogenetic inference deals with historical events and

information transfer – the evolutionary history Results from phylogenetic analyses are hypotheses for

further testing; the true history will remain unknown Inference is mathematically intricate and computationally

heavy, and as a result methods for phylogenetic inference are legio. A good place to start looking for software is http://evolution.genetics.washington.edu/phylip/software.html

There are several pitfalls to avoid when doing the analyses and when interpreting them – and most of the problems are data dependent…

But… Phylogenies have great explanatory power (the only we have to predict properties of organisms), and ignoring the shared histories can sometimes give completely bogus results in comparative studies

http://evolution.genetics.washington.edu/phylip/software.html







Download - Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides

Top Related