Download - Intro. To Phylogenetic Analysis

Intro. To Phylogenetic Analysis

Slides modified by David Ardell

From Caro-Beth Stewart, Paul Higgs,

Joe Felsenstein and Mikael Thollesson

C-B Stewart, NHGRI lecture, 12/5/00

What is phylogenetic analysis and why should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or “tree building” — evolutionary relationships between genes or species

2. Character and rate analysis —mapping information onto trees


Ancestral Node or ROOT of

the TreeInternal Nodes (represent hypothetical ancestors of

the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

CLADE

A

B

C

D

X and Y are defined to be more closely related to each other than to Z if, and only if, they share a more recent common ancestor than they do with Z

D C A BB A C D


All of these rearrangements show the same evolutionary relationships between the taxa

B

A

C

D

A

B

D

C

B

C

A

D

B

D

A

C

B

AC

DRooted tree 1a

B

A

C

D

A

B

C

D


Taxon A

Taxon B

Taxon C

Taxon D

no meaning

Three types of trees

Cladogram

All show the same branching orders between taxa.

groupings


Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

evolutionary distance

Taxon A

Taxon B

Taxon C

Taxon D

no meaning


Cladogram Phylogram


groupings groupings + distance


Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

Evolutionary distance

Taxon A

Taxon B

Taxon C

Taxon D

time

Taxon A

Taxon B

Taxon C

Taxon D

no meaning


Cladogram Phylogram Ultrametric tree


groupings groupings + distance groupings + time


Similarity vs. Evolutionary Relationship:

Since taxa evolve at different rates, your closest relative could be very different

Taxon A

Taxon B

Taxon C (think lamprey)

Taxon D

1

1

1

6

3

5

C is closer to A but more closely relatedto B

This is why the closest BLAST hit is not necessarily the closest relative, and why you need to make trees.

Types of Similarity

Observed similarity between two entities can be due to:

Evolutionary relationship:Shared ancestral characters (‘plesiomorphies’)Shared derived characters (‘’synapomorphy’)

Homoplasy (independent evolution of the same character):Convergent events,Parallel events, Reversals

CC

G

G

C

C

G

G

CG

G C

C

G

GT


A few examples of what can be inferred from phylogenetic trees built from DNA

or protein sequence data:

• Which species are the closest living relatives of modern humans?

• Did the infamous Florida Dentist infect his patients with HIV?

• What were the origins of specific transposable elements?

Which species are the closest living relatives of modern humans?

Classical view

Humans

Bonobos

Gorillas

Orangutans

Chimpanzees

MYA015-30

Which species are the closest living relatives of modern humans?

Molecular viewClassical view

MYA

Chimpanzees

OrangutansHumans

Bonobos

Gorillas Humans

Bonobos

GorillasOrangutans

Chimpanzees

MYA015-30 014

Did the Florida Dentist infect his patients with HIV?

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998)

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:


Uses of character mapping:

• Dating adaptive evolutionary events

• Ancestral reconstruction

• Testing biological hypotheses of correlated function or change

Ex: Where geographically was thecommon ancestor of African apes and humans?

Eurasia = Black Africa = Red

= Dispersal

Modified from: Stewart, C.-B. & Disotell,T.R. (1998) Current Biology 8: R582-588.

Scenario B requires fourfewer dispersal events

OW Monkeys

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Ouranopithecus

Dryopithecus

Lufengpithecus

Living Species

Living + Fossil Species

Oreopithecus

Proconsul

OW Monkeys

OW Monkeys

Kenyapithecus

OW Monkeys

Kenyapithecus

Proconsul

Ouranopithecus

Dryopithecus

Lufengpithecus

Oreopithecus

Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland


Building Trees

COMPUTATIONAL METHOD

Clustering algorithmOptimality criterion

DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Types of data:

Character-data: Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based data: pairwise distances (dissimilarities)

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Uncorrected“p” distance

Example 2: Kimura 2-parameter distance


Building Trees



DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Parsimony

Given two trees, the one requiring the lowest number of character changes to explain the observations is the better

– Parsimony score for a tree is the minimum number of required changes

– This score is frequently referred to as number of steps or tree length

Parsimony – an example acgtatgga acgggtgca aacggtgga aactgtgca

: c

: c

: a

: a

: c

: c

: a

: a

: c

: a

: a

: c

Total tree length: 7 Total tree length: 8 Total tree length: 8


Building Trees



DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Using modelsObserved differences

Actual changes

A G

C T

€

Q =

−3α α α α

α −3α α α

α α −3α α

α α α −3α

⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

Example: Jukes-Cantor

pij =14

−14e−4αt

pij =14

+34e−4αt

, if i=j

, if i≠j

A C GC

A C G T

ACGT

-55,0

-54,5

-54,0

-53,5

-53,0

-52,5

-52,0

-51,5

-51,0

-50,5

0 0,02 0,04 0,06 0,08 0,1

30 nucleotides from -globin genes of two primates on a one-edge tree * *

Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGGOrangutan GGACTCCTTGAGAAATAAACTGCACACTGG

There are two differences and 28 similarities

L =1

161−e−4αt( )

⎡ ⎣ ⎢

⎤ ⎦ ⎥

2 116

1+3e−4αt( )

⎡ ⎣ ⎢

⎤ ⎦ ⎥

28

t

lnL

t= 0.02327lnL= -51.133956

Likelihood of a one-branch tree…

A recipe for phylogenetic inference

Collect your data Select an optimality criterion (“which tree is better?”, tree

score) Optional: do data transformation (“corrections”) Select a search strategy to find the best tree Find the best hypothesis according to that criterion Assess the variation in your data in some way

Finding the best tree

Number of (rooted) trees– 3 taxa -> 3 trees– 4 taxa -> 15 trees– 10 taxa -> 34 459 425 trees– 25 taxa -> 1,19·1030 trees– 52 taxa -> 2,75·1080 trees

Finding the optimal tree is an NP-complete problem

–Search strategiesExact

Exhaustive Branch and bound

Algorithmic Greedy algorithms, a.k.a.

hill-climbing (including Neighbor-joining)

Heuristic Systematic; branch-

swapping (NNI, SPR, TBR)

Stochastic – Markov Chain Monte

Carlo (MCMC)– Genetic algorithms


Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved,bifurcating phylogeny

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

“Star-Decomposition”


There are three possible unrooted trees on four taxa (A, B, C, D)

A C

B D

Tree 1

A B

C D

Tree 2

A B

D C

Tree 3


The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

What is a “good” method?

Efficiency Power Consistency Robustness Falsifiability

– Time to find a/the solution

– Rate of convergence/how much data are needed

– Convergence to “correct” solution as data are added

– Performance when assumptions are violated

– Rejection of the model when inadequate

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

10 100 1000 10000 100000

Lakes invariants Parsimonny, uniform

UPGMA, Kimura NJ, Kimura

ML, Kimura Parsimony, weighted

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

10 100 1000 10000 100000

UPGMA, Kimura

NJ, percentage

Parsimony, uniform

Parsimony,weightedNJ, Kimura

ML, Kimura

Frequency of correct inference

Sequence length

All 0.50

0.30 and0.05 respectively

Performance on simulated data

+ and – of the methods Pair-wise, NJ, distance approach

+ Fast (efficiency)

+ Models can be used to make distances (can be consistent)

– pairwise distances throw out information (loss of power)

– One will get a tree, but no score to compare with other trees or hypotheses

Parsimony and tree-search+ Philosophically appealing – Occam’s razor

– Can be inconsistent

– Can be computationally slow due to a huge number of possible trees Maximum likelihood and tree-search

+ Model-based, can be consistent, powerful, gain biological info

– Model-based, bad when you have the wrong model

– Computationally veeeeery slow due to heavy calculations in determining the tree score and a huge number of possible trees

The quick and dirty, pretty good tree

Calculate model-based pairwise distances. Make a Neighbor-Joining Tree Do a bootstrap

A recipe for phylogenetic inference

Collect your data Select an optimality criterion (“which tree is better”?) Optional: do data transformation (“corrections”) Select a search strategy to find the best tree Find the best hypothesis according to that criterion Assess the variation in your data in some way

Assessing the variation

Jackknife – resampling without replacement Bootstrap – resampling with replacement



1. Resample columns from an alignment with replacement to make a simulated sample of the same size




2. Analyze this resampled dataset in the same way as you did the original sample





3. Repeat this 100+ times, making 100 bootstrap trees





3. Repeat this 100+ times, making 100 bootstrap trees

4. Summarize, for example, as a majority-rule consensus tree

5. Clades in 50% of the trees will be shown, need 70% to be called “weakly supported”

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Aus C G A C G G T G G T C T A T A C A C G ABeus C G G C G G T G A T C T A T G C A C G GCeus T G G C G G C G T C T C A T A C A A T ADeus T A A C G A T G A C C C G A C T A T T G

Original data set with n characters.

2 3 13 8 3 19 14 6 20 20 7 1 9 11 17 10 6 14 8 16Aus G A A G A G T G A A T C G C A T G T G CBeus G G A G G G T G G G T C A C A T G T G CCeus G G A G G T T G A A C T T T A C G T G CDeus A A G G A T A A G G T T A C A C A A G T

Draw n characters randomly with re-placement. Repeat m times.

m pseudo-replicates, each with n characters.

Aus

Beus

Ceus

Deus

Original analysis, e.g. MP, ML, NJ.

Aus

Beus

Ceus

Deus

75%

Evaluate the results from the m analyses.

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Aus

Beus

Ceus

Deus

Repeat original analysis on each of the pseudo-replicate data sets.

Bootstrap

NB! The consensus tree is not a phylogenetic hypothesis, but a way to summarize other trees – in this case bootstrapped trees


Rooting

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree


Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B


An unrooted, four-taxon tree can be rooted in five different places

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

Outgroup rooting: Uses taxa or sequences (the “outgroup”) known to fall outside all the others (the “ingroup”). Requires prior knowledge.

There are two major ways to root trees:

A

B

C

D

10

2

3

5

2

Midpoint rooting:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes clock-like evolution.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9


x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rooted anywhere along any of its branches

We have arrived at a tree – can we trust it as a good hypothesis of the phylogeny?

What can go wrong? Sampling error

– Assessed by - for example - the bootstrap Too superficial tree search

– Remember – finding the best tree is really hard– Systematic error (inconsistent method)– Tests of the adequacy of models used– Premeditated use of different methods

Reality– A tree may be a poor model of the real history– Information has been lost by subsequent evolutionary changes

“Species” vs. “gene” trees

Canis MusGadus

What is wrong with this tree?

Negligible (within sequence) sampling error

Tree estimated by a consistent method

100

100

Gene duplication

“Species” tree

“Gene” trees

The expected tree…

Canis Mus Gadus Gadus Mus Canis

Two copies (paralogs) present in the genomes

Paralogous

Orthologous Orthologous

Canis Gadus Mus

What we have studied…

Canis Gadus Mus

What we have studied…

Message: specific loss patterns of paralogs can disrupt species trees if we don’t know what is a paralogAnd what is an ortholog

To conclude– Phylogenetic inference deals with historical events and information

transfer through time Results from phylogenetic analyses are hypotheses for further testing;

the true history will remain unknown Inference is mathematical intricate and computational heavy, and as a

result methods for phylogenetic inference are legio There are several pitfalls to avoid when doing the analyses and when

interpreting them But… Ignoring the shared histories can sometimes give completely

bogus results in comparative studies

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E))

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

Download - Intro. To Phylogenetic Analysis

Top Related