introduction to biosystematics - zool · pdf fileintroduction to biosystematics - zool 575 2...
TRANSCRIPT
Introduction to Biosystematics - Zool 575
1
Announcements
1. Email me your list of 5 questions with correct
answers before April 12th (last lecture) - I will post
them on the website as a FAQs page
- Must have date of question (approx. OK)
- Question (proper spelling, grammar, etc.)
- Answer (if answer is ambiguous, or unknown,
do not use)
- Worth bonus points of maximum 2%
2. Final paper deadline changed to April 12th
Introduction to Biosystematics
Lecture 33 - Troubleshooting Molecular
Phylogenies
Derek S. Sikes University of Calgary Zool 575
Troubleshooting Phylogenies
Outline
1. Parts of the tree seem “wrong”
2. Branch support values are low
3. Signals from different datasets or partitions conflict
4. Rooting the tree is problematic
5. Computation time is too long
Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual
Review of Ecology and Systematics 33: 49-72
This speaks to the importance of knowing:
1. What all past investigations have suggested
concerning the phylogeny
2. What morphology (& biogeography, etc.) suggest
concerning the phylogeny
How else would one ascertain something “seems wrong”?
Probably many errors published because this information
was not used
Parts of the tree seem “wrong”
Example - my lab:
mtDNA (COI + COII)
Species seems to be in
“wrong” place
Contamination?
Tube mix up?
Misidentification?
Copy-paste error?
How could one even
suspect without
morphology (or other
information)?
Either the results are wrong or the other information is
wrong
- Always investigate both possibilities:
- Problems with data / analysis
(comparatively common)
- Long-held traditional views on relationships, or
morphological data, etc. are sometimes incorrect
(comparatively rare)
Parts of the tree seem “wrong”
Introduction to Biosystematics - Zool 575
2
Onus is on the investigator to show results are strongly
supported and alternative explanations have been rejected
• Check for pre-processing errors:
- alignment
a) compare manual to computer alignments
b) use 2ndary structure or amino acid sequence
c) delete / ignore ambiguous regions
d) do analyses with different alignments of same
data & compare
- orthology / paralogy
a) use single-copy genes, gel-extract desired band
- lab error (tube mix up, misidentification, copy/paste)
a) use multiple samples one level below level of interest
Parts of the tree seem “wrong”
Onus is on the investigator to show results are strongly
supported and alternative explanations have been rejected
2. Sample more genes with different substitution rates:
- Compare faster with slower genes
a) deep level phylogenies - use slower genes
b) shallow level phylogenies - use faster genes
c) medium-level - use a mixture of genes of different rates
- Compare nuclear with mitochondrial (or chloroplast)
McCracken, K.G. and Soreson, M.D. (2005) Is Homoplasy or Lineage Sorting the Source of Incongruent
mtDNA and Nuclear Gene Trees in the Stiff-Tailed Ducks (Nomonyx-Oxyura)? Syst Biol 54: 35-55.
Parts of the tree seem “wrong”
3. Check if rate heterogeneity is sufficient to cause LBA
“Detecting phylogenetic problems that are in the Felsenstein
zone is a lot like detecting black holes;
such phenomena can only be inferred indirectly.”
- J. Huelsenbeck, Sys. Bio. 1998: p 526
Parts of the tree seem “wrong”
3. Check if rate heterogeneity is sufficient to cause LBA
How to tell if your analysis is being misled by Long Branch
Attraction - (Huelsenbeck 1998; Siddal & Whiting 1999)
1. Are there long branches?
2. If so, are they joined in Parsimony analyses?
3. Are they joined by Maximum Likelihood methods?
4. Do they join to different branches in each other’s absence?
5. Are the branches long enough to attract?
Parts of the tree seem “wrong”
1. Are there long branches?
0.826493
0.590521
Use ML or BI to estimate branch lengths
- not MP (which can underestimate or be
equivocal about branch lengths)
2. Are they joined in Parsimony analyses?
97
1 of 4 shortest trees L=75
With strong support …
But if MP doesn’t join - then LBA is
rejected
LBA due to model
violation
Introduction to Biosystematics - Zool 575
3
0.826493
0.590521
3. Are they joined by Maximum Likelihood methods?
If not with ML but with MP
- then possible LBA affecting MP
- or relationship is correct (Farris zone)
If with ML & MP
- then possible LBA affecting MP & ML
- or relationship is correct
- Model violation! - Make sure
model is “best-fitting”
4. Do they join to different branches in each
other’s absence?
1 of 12 shortest MP trees
Parsimony
If they join to the same place in
each other’s absence it cannot
be due to LBA
4. Do they join to different branches in each
other’s absence?
1 of 206 shortest trees
Parsimony
5. Are the branches long enough to attract?
This can be answered using simulated data
What if…
Data had evolved
following these same
branch lengths?
On a tree in which the
long branches are
separate?
5. Are the branches long enough to attract?
Methods - example:
6 trees with the long branches separate
1 tree with them together
(built from 2 max lnL trees of MCMC with each long branch alone)
100 simulated DNA datasets per tree of lengths
50, 100, 250, 500, 1000 base-pairs
7 x 5 x 100 = 3500 per model, 3 models = 10500
analyses, took 170 CPU days
5. Are the branches long enough to attract?
Methods - example:
Data generated with program Seq. Gen. v1.27
using the simplest (and least realistic) model
of evolution (Jukes Cantor)
equal base frequencies, no ASRV, equal
substitution probabilities
Morphological data simulated from JC69
datasets - removed constant characters
Introduction to Biosystematics - Zool 575
4
Recorded # times k + r
Monophyletic
Parsimony Bootstrap >83%
Bayesian PP >89%
~ 95% Probability
(Erixon et al. 2003)
Felsenstein zone Tree 5
4. Assess the extent of Base Composition Bias &
Heterogeneity - another source of phylogenetic error
- Some genes are biased in favor of AT or GC
- This compositional bias alone has explained some erroneous
results
- Some cases are so bad even the use of amino acids is not
free of the bias & yields erroneous results
Foster, P. G. & Hickey D. A. (1999) Compositional bias may affect both DNA-based and protein-based
phylogenetic reconstructions. J Mol Evol 48: 284-290.
Parts of the tree seem “wrong”
4. Assess the extent of Base Composition Bias &
Heterogeneity
Foster, P. G. & Hickey D. A. (1999) Compositional bias may affect both DNA-based and protein-based
phylogenetic reconstructions. J Mol Evol 48: 284-290.
Parts of the tree seem “wrong”
Correct tree:
ML tree - wrong due to
base composition bias
4. Assess the extent of Base Composition Bias &
Heterogeneity
- Can compare base composition across taxa in various
programs - standard to do so prior to any analysis
- Using PAUP* one can conduct a Chi-square test of
homogeneity of base frequencies across taxa
- Can also compare a newer ML model that relaxes the
assumption of homogeneity (stationarity) across taxa to one
that assumes it (eg GTR) - use the Likelihood Ratio Test to
compare (more on LRT in divergence dating lecture - see also ModelTest
documentation & Posada Chapter)
Parts of the tree seem “wrong” PAUP* - test of homogeneity
Chi-square test of homogeneity of base frequencies across taxa:
Taxon A C G T
----------------------------------------------------------
americanus2 O 535 232 214 622
E 516.51 252.92 226.80 606.76
carolinus3 O 550 225 208 637
E 521.99 255.60 229.21 613.20
concolor O 287 128 106 307
E 266.80 130.64 117.15 313.41
defodiens1 O 707 357 324 862
E 724.99 355.00 318.35 851.66
didymus O 288 123 102 315
E 266.80 130.64 117.15 313.41
encaustus O 297 118 101 310
E 266.15 130.33 116.87 312.66
guttula O 292 116 107 313
E 266.80 130.64 117.15 313.41
hispaniola1 O 685 370 347 846
E 724.34 354.69 318.06 850.91
humator2 O 611 270 252 739
E 603.19 295.36 264.87 708.58
hybridus3 O 712 365 329 841
E 724.02 354.53 317.92 850.53
interruptus O 285 120 107 317
E 267.12 130.80 117.29 313.79
----------------------------------------------------------
Chi-square = 120.233380 (df=237), P = 1.00000000
Warning: This test ignores correlation due to phylogenetic structure.
Introduction to Biosystematics - Zool 575
5
4. Assess the extent of Base Composition Bias &
Heterogeneity
If the hypothesis of homogeneity is not rejected:
- the taxa (OTUs) are said to exhibit stationarity or base
compositional equilibrium
- stationarity more likely among closely related taxa
Parts of the tree seem “wrong”
4. Assess the extent of Base Composition Bias &
Heterogeneity
If identified: may (?) be solved with
- use of transversions or amino acids instead of DNA
- use of LogDet distances (a correction for distance data
that specifically addresses base composition bias)
- but results have been mixed - some cases have not been
solved
- can be mixed with other problems, like LBA
Parts of the tree seem “wrong”
5. Test for a Covariotide / Covarion Effect
- rates of variation may vary not only among sites and among
lineages but among both simultaneously
- eg some sites are invariable in certain taxa but free to vary
in others (ie the ASRV correction which assumes the same
correction applies to all taxa may be wrong)
- Covariotide (DNA) covarion (protein) effect
- Lockhart et al. (1998) - inequality test for this
Lockhart, P.J., Steel, M.A., Barbrook, A.C., Huson, D.H., Charleston, M.A. and Howe, C.J. (1998) A covariotide model explains
apparent phylogenetic structure of oxygenic photosynthetic lineages. Molecular Biology and Evolution 15: 1183-1188.
Parts of the tree seem “wrong” Troubleshooting Phylogenies
Outline
1. Parts of the tree seem “wrong”
2. Branch support values are low
3. Signals from different datasets or partitions conflict
4. Rooting the tree is problematic
5. Computation time is too long
Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual
Review of Ecology and Systematics 33: 49-72
1. Be aware of known BS bias
- Compare to Bayesian posterior probabilities
- Use bootstrap correction factors (not really an option - too
computationally complex - often requires squaring the
number of replicates)
2. Gather more data
- More genes (characters)
- More taxa (big literature on comparison of benefits of more
taxa versus more genes)
Branch support values are lowCummings, M.P., Otto, S. P., and
Wakely, J. (1995) Sampling
properties of DNA sequence data
in phylogenetic analysis.
Molecular Biology and Evolution
12: 814-822.
Mitochondrial genome tree
All blank nodes = 100%
Introduction to Biosystematics - Zool 575
6
Cummings, M.P., Otto, S. P., and Wakely, J. (1995) Sampling properties of DNA
sequence data in phylogenetic analysis. Molecular Biology and Evolution 12: 814-
822.
Used whole mitochondrial genomes (16,075 base pairs, 10 vertebrate species)
Found that ca. 8,000 sites were needed to get 95% accuracy
__________________________________________________________________________
Gene length MP ML NJ
__________________________________________________________________________12s rRNA 1,111 + + -
16S rRNA 1,786 - + +
ATPase6 687 - - -
ATPase8 207 - - -
COI 1,560 - + -
COII 705 - - -
COIII 785 + + -
CYTB 1,149 - + -
NADH1 981 - - -
NADH2 1,047 - - -
NADH3 350 - - -
NADH4 1,387 + - +
NADH4L 297 - - -
NADH5 1,860 - - +
NADH6 561 - - -
Identical to genome tree 3/15 5/15 3/15
___________________________________________________________________________
_
3. Check for Rogue Taxa
- One or more OTUs that are unstable can destabilize an
entire tree or portions thereof
- Can be unstable due to missing data (eg fossils) or being
long branches (high rates of change)
- In such cases the closest relative of the rogue taxon is
ambiguous
- Try consensus agreement subtrees
Branch support values are low
Long branch rogue taxon - low PP values
a “floater”
Same analysis without the rogue taxon
- PP values increase throughout tree
4. Avoid using “Fast” bootstrap methods
- PAUP* and other programs allow one to use shortcuts to
speed up bootstrapping
- These involve doing less rigorous heuristic searches -
typically using only a starting tree and no branch swapping
- Shown to lower BS values, especially in larger trees
5. Bootstrapping with ML & Bayesian - use correct models
& experiment with model complexity
Buckley, T.R. and Cunningham, C.W. (2002) The effects of nucleotide substitution model assumptions on
estimates of nonparametric bootstrap support. Molecular Biology and Evolution 19: 394-405.
Branch support values are low Troubleshooting Phylogenies
Outline
1. Parts of the tree seem “wrong”
2. Branch support values are low
3. Signals from different datasets or partitions conflict
4. Rooting the tree is problematic
5. Computation time is too long
Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual
Review of Ecology and Systematics 33: 49-72
Introduction to Biosystematics - Zool 575
7
1. Compare signals of different partitions
- Strong but contradictory signals in different data partitions
can lead to low branch support & errors
- Most commonly used method is the ILD (Incongruence
Length Difference) test
- Compares the difference in tree lengths between a tree
based on both partitions with the sum of the tree lengths of
each partition on its own tree
Dxy = L (x+y) - (Lx + Ly)
Dxy = length difference
Signals from Different Data Sets or
Partitions conflict
- Significance is judged by creating random data partitions of
the same size from the original data set (composed of both
partitions) and comparing the Dxy of the simulated data to
the real Dxy
- If the signals are the same, there should no significant
difference
- Implemented in PAUP* - the ILD test is a common step prior
to analysis of combined data to determine if the partitions
are safe to analyze together
- Incongruence results from different signals, random error
(noise) or a combination of these
Signals from Different Data Sets or
Partitions conflict
2. Resolving conflicts among data partitions
- Conflict is frequently found as datasets grow larger
- Some (cladists, eg Kluge 1989), argue to always combine
regardless of signal conflict
- The justification being that the parsimony tree is an
explanation of all the data (and has thus been more
rigorously tested for rejection - the least falsified
hypothesis)
- Unresolved tree from combined analysis more honestly
represents the data (since we can’t know which partition
is correct)
Signals from Different Data Sets or
Partitions conflict
2. Resolving conflicts among data partitions
- Others argue to combine if the source of incongruence can
be found & removed
- Or to analyze separately
- Or to arbitrate with more data - if all partitions agree except
one, remove the troublesome partition
- See Sanderson & Shaffer 2002 for more
Signals from Different Data Sets or
Partitions conflict
Troubleshooting Phylogenies
Outline
1. Parts of the tree seem “wrong”
2. Branch support values are low
3. Signals from different datasets or partitions conflict
4. Rooting the tree is problematic
5. Computation time is too long
Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual
Review of Ecology and Systematics 33: 49-72
- An often overlooked but critical component
- Without rooting, character evolution cannot be inferred
- Need rooted trees to infer monophyly (and thus
classification)
- Unrooted trees are found using most software - the root
branch is chosen afterwards
- Multiple outgroups can be added to an analysis to test
ingroup monophyly - the test is obviously stronger if the
outgroups are close to the ingroups
Rooting the Tree is Problematic
Introduction to Biosystematics - Zool 575
8
For example:
- If you are studying “reptiles” and use fish as an outgroup the
“reptiles” will ALWAYS be found to be monophyletic
- If you use birds as an outgroup instead the test is stronger -
if one of your outgroups is resolved as a member of your
ingroup then ingroup monophyly is rejected
Best practice, if possible:
Choose multiple outgroups as close to the ingroup as
possible
Rooting the Tree is Problematic
1. Dealing with distantly related outgroups
- Sometimes there is no outgroup close to your ingroup
- If they are very distant they may be no better than a
randomized sequence (ie all historical information is gone)
- This can lead to Long Branch Attraction in which the longest
ingroup branch joins to the root
- Test different outgroups, try them alone & in combination
- Try midpoint rooting…
Rooting the Tree is Problematic
2. What to do when no obvious outgroup is known
- When group is poorly known or at the base of the tree of life
- Try midpoint rooting - assigns the root to the midpoint of
the longest path between two terminal taxa
- Desperate method, may not identify correct root
- Compare midpoint rooting to rooting with alternate
outgroups
Rooting the Tree is Problematic Rooting the Tree is Problematic
Example: two subfamilies that are known to be sister
lineages in a family
But immediate outgroup to family is unknown
Only more distant (blue) outgroups are known
- cannot infer evolution of many characters
Troubleshooting Phylogenies
Outline
1. Parts of the tree seem “wrong”
2. Branch support values are low
3. Signals from different datasets or partitions conflict
4. Rooting the tree is problematic
5. Computation time is too long
Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual
Review of Ecology and Systematics 33: 49-72
- Datasets are getting larger
- whole genomes are becoming available
- 500 + taxon matrices for fewer genes
- parametric (model) methods are computationally complex
- Confidence measures like bootstrapping increases the
computational time 100-fold or more
- “Besides taking increasingly long coffee breaks, several
strategies may help.”
Computation Time is Too Long
Introduction to Biosystematics - Zool 575
9
1. Use Fast(er) Algorithms
- See lecture on large dataset analyses
- heuristics modified to avoid being trapped in local optima
- newer algorithms / programs built to run faster
- run risk that accuracy / rigor is sacrificed (general tradeoff
between rigor and time)
Computation Time is Too Long
2. Reduce Complexity of Parametric Models
- Use model selection criteria to choose a model that is not
unnecessarily complex
- Use a quick method to infer a ‘good’ tree, eg MP, and
estimate ML parameters & fix these for subsequent
analysis (best to iteratively fix & search, fix again using
new tree, continue until parameters become stable)
Computation Time is Too Long
3. Sequence More Genes for the Same Taxa
- Although adding more taxa can increase accuracy it
usually slows computational time
- Adding more characters for the same taxa often reduces
the number of equally good trees & speeds up the
searches (makes the data more decisive)
- Only true if additional characters have same signal
Computation Time is Too Long
4. Estimate the “Consensus Support” tree rather than the
optimal tree
- time otherwise spent on search for optimal tree is saved
- search only for estimate of strength of signal in data
- eg Bayesian Inference MCMC
- Sanderson & Shaffer seem biased: such methods
“probably do not provide good estimates of the optimal tree”
- odd statement in light of the fact that Bayesian MCMC
typically finds same topology as ML
Computation Time is Too Long
5. Use Faster Hardware
- high performance computing - chips speeds double every
1.5 years (Moore’s Law)
- parallel processing is becoming more widely available
- parallelized software for MP, ML, and BI MCMC is
available
- all CPU nodes should be of equal speed - search will
go as slow as slowest node
Computation Time is Too Longbase composition bias
chi-square test of homogeneity of base frequencies across
taxa
stationarity / base compositional equilibrium
LogDet Distances
covariotide / covarion effect
rogue taxa
ILD test
midpoint rooting
Terms - from lecture & readings
Introduction to Biosystematics - Zool 575
10
What are four things to investigate if parts of the tree seem“wrong”?
What is the logic, when investigating possible LBA, behind testing
to see if long branches join to different places when analyzedalone?
If long branches are found to be sister taxa when analyzed using
ML and a best-fitting model, what can one say about the possibilityof LBA artifacts?
McCracken & Sorenson (2005) dealt with a problem of short
internal branches - they compared “hard” versus “soft” polytomies.Describe these two types of polytomies - which seems a better
explanation for these author!s results?
Study questions
The mitochondrial genome of vertebrates includes over 16,000
sites - based on Cummings et al (1995) approximately how many(minimum) of these sites are needed to obtain >95% confidence for
all nodes of their 10-species vertebrate phylogeny?
How might one determine if two or more data partitions are
congruent and what alternative actions might be taken (and why) ifthey are found to be incongruent?
What is the best practice for rooting a tree?
Study questions