Download - Intro. To Phylogenetic Analysis
Intro. To Phylogenetic Analysis
Slides modified by David Ardell
From Caro-Beth Stewart, Paul Higgs,
Joe Felsenstein and Mikael Thollesson
C-B Stewart, NHGRI lecture, 12/5/00
What is phylogenetic analysis and why should we perform it?
Phylogenetic analysis has two major components:
1. Phylogeny inference or “tree building” — evolutionary relationships between genes or species
2. Character and rate analysis —mapping information onto trees
C-B Stewart, NHGRI lecture, 12/5/00
Ancestral Node or ROOT of
the TreeInternal Nodes (represent hypothetical ancestors of
the taxa)
Branches or Lineages
Terminal Nodes
A
B
C
D
E
Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny
Common Phylogenetic Tree Terminology
CLADE
A
B
C
D
X and Y are defined to be more closely related to each other than to Z if, and only if, they share a more recent common ancestor than they do with Z
D C A BB A C D
C-B Stewart, NHGRI lecture, 12/5/00
All of these rearrangements show the same evolutionary relationships between the taxa
B
A
C
D
A
B
D
C
B
C
A
D
B
D
A
C
B
AC
DRooted tree 1a
B
A
C
D
A
B
C
D
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
Taxon A
Taxon B
Taxon C
Taxon D
no meaning
Three types of trees
Cladogram
All show the same branching orders between taxa.
groupings
C-B Stewart, NHGRI lecture, 12/5/00
Taxon A
Taxon B
Taxon C
Taxon D
1
1
1
6
3
5
evolutionary distance
Taxon A
Taxon B
Taxon C
Taxon D
no meaning
Three types of trees
Cladogram Phylogram
All show the same branching orders between taxa.
groupings groupings + distance
C-B Stewart, NHGRI lecture, 12/5/00
Taxon A
Taxon B
Taxon C
Taxon D
1
1
1
6
3
5
Evolutionary distance
Taxon A
Taxon B
Taxon C
Taxon D
time
Taxon A
Taxon B
Taxon C
Taxon D
no meaning
Three types of trees
Cladogram Phylogram Ultrametric tree
All show the same branching orders between taxa.
groupings groupings + distance groupings + time
C-B Stewart, NHGRI lecture, 12/5/00
Similarity vs. Evolutionary Relationship:
Since taxa evolve at different rates, your closest relative could be very different
Taxon A
Taxon B
Taxon C (think lamprey)
Taxon D
1
1
1
6
3
5
C is closer to A but more closely relatedto B
This is why the closest BLAST hit is not necessarily the closest relative, and why you need to make trees.
Types of Similarity
Observed similarity between two entities can be due to:
Evolutionary relationship:Shared ancestral characters (‘plesiomorphies’)Shared derived characters (‘’synapomorphy’)
Homoplasy (independent evolution of the same character):Convergent events,Parallel events, Reversals
CC
G
G
C
C
G
G
CG
G C
C
G
GT
C-B Stewart, NHGRI lecture, 12/5/00
A few examples of what can be inferred from phylogenetic trees built from DNA
or protein sequence data:
• Which species are the closest living relatives of modern humans?
• Did the infamous Florida Dentist infect his patients with HIV?
• What were the origins of specific transposable elements?
Which species are the closest living relatives of modern humans?
Classical view
Humans
Bonobos
Gorillas
Orangutans
Chimpanzees
MYA015-30
Which species are the closest living relatives of modern humans?
Molecular viewClassical view
MYA
Chimpanzees
OrangutansHumans
Bonobos
Gorillas Humans
Bonobos
GorillasOrangutans
Chimpanzees
MYA015-30 014
Did the Florida Dentist infect his patients with HIV?
DENTIST
DENTIST
Patient D
Patient F
Patient C
Patient A
Patient G
Patient BPatient E
Patient A
Local control 2
Local control 3
Local control 9
Local control 35
Local control 3
Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.
No
No
From Ou et al. (1992) and Page & Holmes (1998)
Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:
C-B Stewart, NHGRI lecture, 12/5/00
Uses of character mapping:
• Dating adaptive evolutionary events
• Ancestral reconstruction
• Testing biological hypotheses of correlated function or change
Ex: Where geographically was thecommon ancestor of African apes and humans?
Eurasia = Black Africa = Red
= Dispersal
Modified from: Stewart, C.-B. & Disotell,T.R. (1998) Current Biology 8: R582-588.
Scenario B requires fourfewer dispersal events
OW Monkeys
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Ouranopithecus
Dryopithecus
Lufengpithecus
Living Species
Living + Fossil Species
Oreopithecus
Proconsul
OW Monkeys
OW Monkeys
Kenyapithecus
OW Monkeys
Kenyapithecus
Proconsul
Ouranopithecus
Dryopithecus
Lufengpithecus
Oreopithecus
Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland
C-B Stewart, NHGRI lecture, 12/5/00
Building Trees
COMPUTATIONAL METHOD
Clustering algorithmOptimality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
C-B Stewart, NHGRI lecture, 12/5/00
Building Trees
COMPUTATIONAL METHOD
Clustering algorithmOptimality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
C-B Stewart, NHGRI lecture, 12/5/00
Building Trees
COMPUTATIONAL METHOD
Clustering algorithmOptimality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
Types of data:
Character-data: Taxa Characters
Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG
Distance-based data: pairwise distances (dissimilarities)
A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----
Uncorrected“p” distance
Example 2: Kimura 2-parameter distance
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
Building Trees
COMPUTATIONAL METHOD
Clustering algorithmOptimality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
Parsimony
Given two trees, the one requiring the lowest number of character changes to explain the observations is the better
– Parsimony score for a tree is the minimum number of required changes
– This score is frequently referred to as number of steps or tree length
Parsimony – an example acgtatgga acgggtgca aacggtgga aactgtgca
: c
: c
: a
: a
: c
: c
: a
: a
: c
: a
: a
: c
Total tree length: 7 Total tree length: 8 Total tree length: 8
C-B Stewart, NHGRI lecture, 12/5/00
Building Trees
COMPUTATIONAL METHOD
Clustering algorithmOptimality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
Using modelsObserved differences
Actual changes
A G
C T
€
Q =
−3α α α α
α −3α α α
α α −3α α
α α α −3α
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
Example: Jukes-Cantor
pij =14
−14e−4αt
pij =14
+34e−4αt
, if i=j
, if i≠j
A C GC
A C G T
ACGT
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
-55,0
-54,5
-54,0
-53,5
-53,0
-52,5
-52,0
-51,5
-51,0
-50,5
0 0,02 0,04 0,06 0,08 0,1
30 nucleotides from -globin genes of two primates on a one-edge tree * *
Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGGOrangutan GGACTCCTTGAGAAATAAACTGCACACTGG
There are two differences and 28 similarities
L =1
161−e−4αt( )
⎡ ⎣ ⎢
⎤ ⎦ ⎥
2 116
1+3e−4αt( )
⎡ ⎣ ⎢
⎤ ⎦ ⎥
28
t
lnL
t= 0.02327lnL= -51.133956
Likelihood of a one-branch tree…
A recipe for phylogenetic inference
Collect your data Select an optimality criterion (“which tree is better?”, tree
score) Optional: do data transformation (“corrections”) Select a search strategy to find the best tree Find the best hypothesis according to that criterion Assess the variation in your data in some way
Finding the best tree
Number of (rooted) trees– 3 taxa -> 3 trees– 4 taxa -> 15 trees– 10 taxa -> 34 459 425 trees– 25 taxa -> 1,19·1030 trees– 52 taxa -> 2,75·1080 trees
Finding the optimal tree is an NP-complete problem
–Search strategiesExact
Exhaustive Branch and bound
Algorithmic Greedy algorithms, a.k.a.
hill-climbing (including Neighbor-joining)
Heuristic Systematic; branch-
swapping (NNI, SPR, TBR)
Stochastic – Markov Chain Monte
Carlo (MCMC)– Genetic algorithms
C-B Stewart, NHGRI lecture, 12/5/00
Completely unresolvedor "star" phylogeny
Partially resolvedphylogeny
Fully resolved,bifurcating phylogeny
A A A
B
B B
C
C
C
E
E
E
D
D D
Polytomy or multifurcation A bifurcation
“Star-Decomposition”
C-B Stewart, NHGRI lecture, 12/5/00
There are three possible unrooted trees on four taxa (A, B, C, D)
A C
B D
Tree 1
A B
C D
Tree 2
A B
D C
Tree 3
C-B Stewart, NHGRI lecture, 12/5/00
The number of unrooted trees increases in a greater than exponential manner with number of taxa
(2N - 5)!! = # unrooted trees for N taxa
CA
B D
A B
C
A D
B E
C
A D
B E
C
F
C-B Stewart, NHGRI lecture, 12/5/00
What is a “good” method?
Efficiency Power Consistency Robustness Falsifiability
– Time to find a/the solution
– Rate of convergence/how much data are needed
– Convergence to “correct” solution as data are added
– Performance when assumptions are violated
– Rejection of the model when inadequate
C-B Stewart, NHGRI lecture, 12/5/00
C-B Stewart, NHGRI lecture, 12/5/00
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
10 100 1000 10000 100000
Lakes invariants Parsimonny, uniform
UPGMA, Kimura NJ, Kimura
ML, Kimura Parsimony, weighted
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
10 100 1000 10000 100000
UPGMA, Kimura
NJ, percentage
Parsimony, uniform
Parsimony,weightedNJ, Kimura
ML, Kimura
Frequency of correct inference
Sequence length
All 0.50
0.30 and0.05 respectively
Performance on simulated data
+ and – of the methods Pair-wise, NJ, distance approach
+ Fast (efficiency)
+ Models can be used to make distances (can be consistent)
– pairwise distances throw out information (loss of power)
– One will get a tree, but no score to compare with other trees or hypotheses
Parsimony and tree-search+ Philosophically appealing – Occam’s razor
– Can be inconsistent
– Can be computationally slow due to a huge number of possible trees Maximum likelihood and tree-search
+ Model-based, can be consistent, powerful, gain biological info
– Model-based, bad when you have the wrong model
– Computationally veeeeery slow due to heavy calculations in determining the tree score and a huge number of possible trees
The quick and dirty, pretty good tree
Calculate model-based pairwise distances. Make a Neighbor-Joining Tree Do a bootstrap
A recipe for phylogenetic inference
Collect your data Select an optimality criterion (“which tree is better”?) Optional: do data transformation (“corrections”) Select a search strategy to find the best tree Find the best hypothesis according to that criterion Assess the variation in your data in some way
Assessing the variation
Jackknife – resampling without replacement Bootstrap – resampling with replacement
Assessing the variation
Jackknife – resampling without replacement Bootstrap – resampling with replacement
1. Resample columns from an alignment with replacement to make a simulated sample of the same size
Assessing the variation
Jackknife – resampling without replacement Bootstrap – resampling with replacement
1. Resample columns from an alignment with replacement to make a simulated sample of the same size
2. Analyze this resampled dataset in the same way as you did the original sample
Assessing the variation
Jackknife – resampling without replacement Bootstrap – resampling with replacement
1. Resample columns from an alignment with replacement to make a simulated sample of the same size
2. Analyze this resampled dataset in the same way as you did the original sample
3. Repeat this 100+ times, making 100 bootstrap trees
Assessing the variation
Jackknife – resampling without replacement Bootstrap – resampling with replacement
1. Resample columns from an alignment with replacement to make a simulated sample of the same size
2. Analyze this resampled dataset in the same way as you did the original sample
3. Repeat this 100+ times, making 100 bootstrap trees
4. Summarize, for example, as a majority-rule consensus tree
5. Clades in 50% of the trees will be shown, need 70% to be called “weakly supported”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Aus C G A C G G T G G T C T A T A C A C G ABeus C G G C G G T G A T C T A T G C A C G GCeus T G G C G G C G T C T C A T A C A A T ADeus T A A C G A T G A C C C G A C T A T T G
Original data set with n characters.
2 3 13 8 3 19 14 6 20 20 7 1 9 11 17 10 6 14 8 16Aus G A A G A G T G A A T C G C A T G T G CBeus G G A G G G T G G G T C A C A T G T G CCeus G G A G G T T G A A C T T T A C G T G CDeus A A G G A T A A G G T T A C A C A A G T
Draw n characters randomly with re-placement. Repeat m times.
m pseudo-replicates, each with n characters.
Aus
Beus
Ceus
Deus
Original analysis, e.g. MP, ML, NJ.
Aus
Beus
Ceus
Deus
75%
Evaluate the results from the m analyses.
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Aus
Beus
Ceus
Deus
Repeat original analysis on each of the pseudo-replicate data sets.
Bootstrap
NB! The consensus tree is not a phylogenetic hypothesis, but a way to summarize other trees – in this case bootstrapped trees
C-B Stewart, NHGRI lecture, 12/5/00
Rooting
To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A
BC
Root D
A B C D
RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.
Rooted tree
Unrooted tree
C-B Stewart, NHGRI lecture, 12/5/00
Now, try it again with the root at another position:
A
BC
Root
D
Unrooted tree
Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.
C D
Root
Rooted tree
A
B
C-B Stewart, NHGRI lecture, 12/5/00
An unrooted, four-taxon tree can be rooted in five different places
The unrooted tree 1:
A C
B D
Rooted tree 1d
C
D
A
B
4
Rooted tree 1c
A
B
C
D
3
Rooted tree 1e
D
C
A
B
5
Rooted tree 1b
A
B
C
D
2
Rooted tree 1a
B
A
C
D
1
Outgroup rooting: Uses taxa or sequences (the “outgroup”) known to fall outside all the others (the “ingroup”). Requires prior knowledge.
There are two major ways to root trees:
A
B
C
D
10
2
3
5
2
Midpoint rooting:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes clock-like evolution.
outgroup
d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9
C-B Stewart, NHGRI lecture, 12/5/00
x =
CA
B D
A D
B E
C
A D
B E
C
F (2N - 3)!! = # unrooted trees for N taxa
Each unrooted tree theoretically can be rooted anywhere along any of its branches
We have arrived at a tree – can we trust it as a good hypothesis of the phylogeny?
What can go wrong? Sampling error
– Assessed by - for example - the bootstrap Too superficial tree search
– Remember – finding the best tree is really hard– Systematic error (inconsistent method)– Tests of the adequacy of models used– Premeditated use of different methods
Reality– A tree may be a poor model of the real history– Information has been lost by subsequent evolutionary changes
“Species” vs. “gene” trees
Canis MusGadus
What is wrong with this tree?
Negligible (within sequence) sampling error
Tree estimated by a consistent method
100
100
Gene duplication
“Species” tree
“Gene” trees
The expected tree…
Canis Mus Gadus Gadus Mus Canis
Two copies (paralogs) present in the genomes
Paralogous
Orthologous Orthologous
Canis Gadus Mus
What we have studied…
Canis Gadus Mus
What we have studied…
Message: specific loss patterns of paralogs can disrupt species trees if we don’t know what is a paralogAnd what is an ortholog
To conclude– Phylogenetic inference deals with historical events and information
transfer through time Results from phylogenetic analyses are hypotheses for further testing;
the true history will remain unknown Inference is mathematical intricate and computational heavy, and as a
result methods for phylogenetic inference are legio There are several pitfalls to avoid when doing the analyses and when
interpreting them But… Ignoring the shared histories can sometimes give completely
bogus results in comparative studies
Phylogenetic trees diagram the evolutionary relationships between the taxa
((A,(B,C)),(D,E))
Taxon A
Taxon B
Taxon C
Taxon E
Taxon D
No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.