introduction to biosystematics - zool · pdf fileintroduction to biosystematics - zool 575 2...

Introduction to Biosystematics - Zool 575

1

Announcements

1. Email me your list of 5 questions with correct

answers before April 12th (last lecture) - I will post

them on the website as a FAQs page

- Must have date of question (approx. OK)

- Question (proper spelling, grammar, etc.)

- Answer (if answer is ambiguous, or unknown,

do not use)

- Worth bonus points of maximum 2%

2. Final paper deadline changed to April 12th

Introduction to Biosystematics

Lecture 33 - Troubleshooting Molecular

Phylogenies

Derek S. Sikes University of Calgary Zool 575

Troubleshooting Phylogenies

Outline

1. Parts of the tree seem “wrong”

2. Branch support values are low

3. Signals from different datasets or partitions conflict

4. Rooting the tree is problematic

5. Computation time is too long

Sanderson, M.J. and Shaffer, H.B. (2002) Troubleshooting molecular phylogenetic analyses. Annual

Review of Ecology and Systematics 33: 49-72

This speaks to the importance of knowing:

1. What all past investigations have suggested

concerning the phylogeny

2. What morphology (& biogeography, etc.) suggest

concerning the phylogeny

How else would one ascertain something “seems wrong”?

Probably many errors published because this information

was not used

Parts of the tree seem “wrong”

Example - my lab:

mtDNA (COI + COII)

Species seems to be in

“wrong” place

Contamination?

Tube mix up?

Misidentification?

Copy-paste error?

How could one even

suspect without

morphology (or other

information)?

Either the results are wrong or the other information is

wrong

- Always investigate both possibilities:

- Problems with data / analysis

(comparatively common)

- Long-held traditional views on relationships, or

morphological data, etc. are sometimes incorrect

(comparatively rare)



2

Onus is on the investigator to show results are strongly

supported and alternative explanations have been rejected

• Check for pre-processing errors:

- alignment

a) compare manual to computer alignments

b) use 2ndary structure or amino acid sequence

c) delete / ignore ambiguous regions

d) do analyses with different alignments of same

data & compare

- orthology / paralogy

a) use single-copy genes, gel-extract desired band

- lab error (tube mix up, misidentification, copy/paste)

a) use multiple samples one level below level of interest


Onus is on the investigator to show results are strongly

supported and alternative explanations have been rejected

2. Sample more genes with different substitution rates:

- Compare faster with slower genes

a) deep level phylogenies - use slower genes

b) shallow level phylogenies - use faster genes

c) medium-level - use a mixture of genes of different rates

- Compare nuclear with mitochondrial (or chloroplast)

McCracken, K.G. and Soreson, M.D. (2005) Is Homoplasy or Lineage Sorting the Source of Incongruent

mtDNA and Nuclear Gene Trees in the Stiff-Tailed Ducks (Nomonyx-Oxyura)? Syst Biol 54: 35-55.


3. Check if rate heterogeneity is sufficient to cause LBA

“Detecting phylogenetic problems that are in the Felsenstein

zone is a lot like detecting black holes;

such phenomena can only be inferred indirectly.”

- J. Huelsenbeck, Sys. Bio. 1998: p 526


3. Check if rate heterogeneity is sufficient to cause LBA

How to tell if your analysis is being misled by Long Branch

Attraction - (Huelsenbeck 1998; Siddal & Whiting 1999)

1. Are there long branches?

2. If so, are they joined in Parsimony analyses?

3. Are they joined by Maximum Likelihood methods?

4. Do they join to different branches in each other’s absence?

5. Are the branches long enough to attract?


1. Are there long branches?

0.826493

0.590521

Use ML or BI to estimate branch lengths

- not MP (which can underestimate or be

equivocal about branch lengths)

2. Are they joined in Parsimony analyses?

97

1 of 4 shortest trees L=75

With strong support …

But if MP doesn’t join - then LBA is

rejected

LBA due to model

violation


3

0.826493

0.590521

3. Are they joined by Maximum Likelihood methods?

If not with ML but with MP

- then possible LBA affecting MP

- or relationship is correct (Farris zone)

If with ML & MP

- then possible LBA affecting MP & ML

- or relationship is correct

- Model violation! - Make sure

model is “best-fitting”

4. Do they join to different branches in each

other’s absence?

1 of 12 shortest MP trees

Parsimony

If they join to the same place in

each other’s absence it cannot

be due to LBA

4. Do they join to different branches in each

other’s absence?

1 of 206 shortest trees

Parsimony


This can be answered using simulated data

What if…

Data had evolved

following these same

branch lengths?

On a tree in which the

long branches are

separate?


Methods - example:

6 trees with the long branches separate

1 tree with them together

(built from 2 max lnL trees of MCMC with each long branch alone)

100 simulated DNA datasets per tree of lengths

50, 100, 250, 500, 1000 base-pairs

7 x 5 x 100 = 3500 per model, 3 models = 10500

analyses, took 170 CPU days


Methods - example:

Data generated with program Seq. Gen. v1.27

using the simplest (and least realistic) model

of evolution (Jukes Cantor)

equal base frequencies, no ASRV, equal

substitution probabilities

Morphological data simulated from JC69

datasets - removed constant characters


4

Recorded # times k + r

Monophyletic

Parsimony Bootstrap >83%

Bayesian PP >89%

~ 95% Probability

(Erixon et al. 2003)

Felsenstein zone Tree 5

4. Assess the extent of Base Composition Bias &

Heterogeneity - another source of phylogenetic error

- Some genes are biased in favor of AT or GC

- This compositional bias alone has explained some erroneous

results

- Some cases are so bad even the use of amino acids is not

free of the bias & yields erroneous results

Foster, P. G. & Hickey D. A. (1999) Compositional bias may affect both DNA-based and protein-based

phylogenetic reconstructions. J Mol Evol 48: 284-290.



Heterogeneity

Foster, P. G. & Hickey D. A. (1999) Compositional bias may affect both DNA-based and protein-based

phylogenetic reconstructions. J Mol Evol 48: 284-290.


Correct tree:

ML tree - wrong due to

base composition bias


Heterogeneity

- Can compare base composition across taxa in various

programs - standard to do so prior to any analysis

- Using PAUP* one can conduct a Chi-square test of

homogeneity of base frequencies across taxa

- Can also compare a newer ML model that relaxes the

assumption of homogeneity (stationarity) across taxa to one

that assumes it (eg GTR) - use the Likelihood Ratio Test to

compare (more on LRT in divergence dating lecture - see also ModelTest

documentation & Posada Chapter)

Parts of the tree seem “wrong” PAUP* - test of homogeneity

Chi-square test of homogeneity of base frequencies across taxa:

Taxon A C G T

----------------------------------------------------------

americanus2 O 535 232 214 622

E 516.51 252.92 226.80 606.76

carolinus3 O 550 225 208 637

E 521.99 255.60 229.21 613.20

concolor O 287 128 106 307

E 266.80 130.64 117.15 313.41

defodiens1 O 707 357 324 862

E 724.99 355.00 318.35 851.66

didymus O 288 123 102 315

E 266.80 130.64 117.15 313.41

encaustus O 297 118 101 310

E 266.15 130.33 116.87 312.66

guttula O 292 116 107 313

E 266.80 130.64 117.15 313.41

hispaniola1 O 685 370 347 846

E 724.34 354.69 318.06 850.91

humator2 O 611 270 252 739

E 603.19 295.36 264.87 708.58

hybridus3 O 712 365 329 841

E 724.02 354.53 317.92 850.53

interruptus O 285 120 107 317

E 267.12 130.80 117.29 313.79

----------------------------------------------------------

Chi-square = 120.233380 (df=237), P = 1.00000000

Warning: This test ignores correlation due to phylogenetic structure.


5


Heterogeneity

If the hypothesis of homogeneity is not rejected:

- the taxa (OTUs) are said to exhibit stationarity or base

compositional equilibrium

- stationarity more likely among closely related taxa



Heterogeneity

If identified: may (?) be solved with

- use of transversions or amino acids instead of DNA

- use of LogDet distances (a correction for distance data

that specifically addresses base composition bias)

- but results have been mixed - some cases have not been

solved

- can be mixed with other problems, like LBA


5. Test for a Covariotide / Covarion Effect

- rates of variation may vary not only among sites and among

lineages but among both simultaneously

- eg some sites are invariable in certain taxa but free to vary

in others (ie the ASRV correction which assumes the same

correction applies to all taxa may be wrong)

- Covariotide (DNA) covarion (protein) effect

- Lockhart et al. (1998) - inequality test for this

Lockhart, P.J., Steel, M.A., Barbrook, A.C., Huson, D.H., Charleston, M.A. and Howe, C.J. (1998) A covariotide model explains

apparent phylogenetic structure of oxygenic photosynthetic lineages. Molecular Biology and Evolution 15: 1183-1188.

Parts of the tree seem “wrong” Troubleshooting Phylogenies

Outline








1. Be aware of known BS bias

- Compare to Bayesian posterior probabilities

- Use bootstrap correction factors (not really an option - too

computationally complex - often requires squaring the

number of replicates)

2. Gather more data

- More genes (characters)

- More taxa (big literature on comparison of benefits of more

taxa versus more genes)

Branch support values are lowCummings, M.P., Otto, S. P., and

Wakely, J. (1995) Sampling

properties of DNA sequence data

in phylogenetic analysis.

Molecular Biology and Evolution

12: 814-822.

Mitochondrial genome tree

All blank nodes = 100%


6

Cummings, M.P., Otto, S. P., and Wakely, J. (1995) Sampling properties of DNA

sequence data in phylogenetic analysis. Molecular Biology and Evolution 12: 814-

822.

Used whole mitochondrial genomes (16,075 base pairs, 10 vertebrate species)

Found that ca. 8,000 sites were needed to get 95% accuracy

__________________________________________________________________________

Gene length MP ML NJ

__________________________________________________________________________12s rRNA 1,111 + + -

16S rRNA 1,786 - + +

ATPase6 687 - - -

ATPase8 207 - - -

COI 1,560 - + -

COII 705 - - -

COIII 785 + + -

CYTB 1,149 - + -

NADH1 981 - - -

NADH2 1,047 - - -

NADH3 350 - - -

NADH4 1,387 + - +

NADH4L 297 - - -

NADH5 1,860 - - +

NADH6 561 - - -

Identical to genome tree 3/15 5/15 3/15

___________________________________________________________________________

_

3. Check for Rogue Taxa

- One or more OTUs that are unstable can destabilize an

entire tree or portions thereof

- Can be unstable due to missing data (eg fossils) or being

long branches (high rates of change)

- In such cases the closest relative of the rogue taxon is

ambiguous

- Try consensus agreement subtrees

Branch support values are low

Long branch rogue taxon - low PP values

a “floater”

Same analysis without the rogue taxon

- PP values increase throughout tree

4. Avoid using “Fast” bootstrap methods

- PAUP* and other programs allow one to use shortcuts to

speed up bootstrapping

- These involve doing less rigorous heuristic searches -

typically using only a starting tree and no branch swapping

- Shown to lower BS values, especially in larger trees

5. Bootstrapping with ML & Bayesian - use correct models

& experiment with model complexity

Buckley, T.R. and Cunningham, C.W. (2002) The effects of nucleotide substitution model assumptions on

estimates of nonparametric bootstrap support. Molecular Biology and Evolution 19: 394-405.

Branch support values are low Troubleshooting Phylogenies

Outline









7

1. Compare signals of different partitions

- Strong but contradictory signals in different data partitions

can lead to low branch support & errors

- Most commonly used method is the ILD (Incongruence

Length Difference) test

- Compares the difference in tree lengths between a tree

based on both partitions with the sum of the tree lengths of

each partition on its own tree

Dxy = L (x+y) - (Lx + Ly)

Dxy = length difference

Signals from Different Data Sets or

Partitions conflict

- Significance is judged by creating random data partitions of

the same size from the original data set (composed of both

partitions) and comparing the Dxy of the simulated data to

the real Dxy

- If the signals are the same, there should no significant

difference

- Implemented in PAUP* - the ILD test is a common step prior

to analysis of combined data to determine if the partitions

are safe to analyze together

- Incongruence results from different signals, random error

(noise) or a combination of these


Partitions conflict

2. Resolving conflicts among data partitions

- Conflict is frequently found as datasets grow larger

- Some (cladists, eg Kluge 1989), argue to always combine

regardless of signal conflict

- The justification being that the parsimony tree is an

explanation of all the data (and has thus been more

rigorously tested for rejection - the least falsified

hypothesis)

- Unresolved tree from combined analysis more honestly

represents the data (since we can’t know which partition

is correct)


Partitions conflict

2. Resolving conflicts among data partitions

- Others argue to combine if the source of incongruence can

be found & removed

- Or to analyze separately

- Or to arbitrate with more data - if all partitions agree except

one, remove the troublesome partition

- See Sanderson & Shaffer 2002 for more


Partitions conflict


Outline








- An often overlooked but critical component

- Without rooting, character evolution cannot be inferred

- Need rooted trees to infer monophyly (and thus

classification)

- Unrooted trees are found using most software - the root

branch is chosen afterwards

- Multiple outgroups can be added to an analysis to test

ingroup monophyly - the test is obviously stronger if the

outgroups are close to the ingroups

Rooting the Tree is Problematic


8

For example:

- If you are studying “reptiles” and use fish as an outgroup the

“reptiles” will ALWAYS be found to be monophyletic

- If you use birds as an outgroup instead the test is stronger -

if one of your outgroups is resolved as a member of your

ingroup then ingroup monophyly is rejected

Best practice, if possible:

Choose multiple outgroups as close to the ingroup as

possible


1. Dealing with distantly related outgroups

- Sometimes there is no outgroup close to your ingroup

- If they are very distant they may be no better than a

randomized sequence (ie all historical information is gone)

- This can lead to Long Branch Attraction in which the longest

ingroup branch joins to the root

- Test different outgroups, try them alone & in combination

- Try midpoint rooting…


2. What to do when no obvious outgroup is known

- When group is poorly known or at the base of the tree of life

- Try midpoint rooting - assigns the root to the midpoint of

the longest path between two terminal taxa

- Desperate method, may not identify correct root

- Compare midpoint rooting to rooting with alternate

outgroups

Rooting the Tree is Problematic Rooting the Tree is Problematic

Example: two subfamilies that are known to be sister

lineages in a family

But immediate outgroup to family is unknown

Only more distant (blue) outgroups are known

- cannot infer evolution of many characters


Outline








- Datasets are getting larger

- whole genomes are becoming available

- 500 + taxon matrices for fewer genes

- parametric (model) methods are computationally complex

- Confidence measures like bootstrapping increases the

computational time 100-fold or more

- “Besides taking increasingly long coffee breaks, several

strategies may help.”

Computation Time is Too Long


9

1. Use Fast(er) Algorithms

- See lecture on large dataset analyses

- heuristics modified to avoid being trapped in local optima

- newer algorithms / programs built to run faster

- run risk that accuracy / rigor is sacrificed (general tradeoff

between rigor and time)


2. Reduce Complexity of Parametric Models

- Use model selection criteria to choose a model that is not

unnecessarily complex

- Use a quick method to infer a ‘good’ tree, eg MP, and

estimate ML parameters & fix these for subsequent

analysis (best to iteratively fix & search, fix again using

new tree, continue until parameters become stable)


3. Sequence More Genes for the Same Taxa

- Although adding more taxa can increase accuracy it

usually slows computational time

- Adding more characters for the same taxa often reduces

the number of equally good trees & speeds up the

searches (makes the data more decisive)

- Only true if additional characters have same signal


4. Estimate the “Consensus Support” tree rather than the

optimal tree

- time otherwise spent on search for optimal tree is saved

- search only for estimate of strength of signal in data

- eg Bayesian Inference MCMC

- Sanderson & Shaffer seem biased: such methods

“probably do not provide good estimates of the optimal tree”

- odd statement in light of the fact that Bayesian MCMC

typically finds same topology as ML


5. Use Faster Hardware

- high performance computing - chips speeds double every

1.5 years (Moore’s Law)

- parallel processing is becoming more widely available

- parallelized software for MP, ML, and BI MCMC is

available

- all CPU nodes should be of equal speed - search will

go as slow as slowest node

Computation Time is Too Longbase composition bias

chi-square test of homogeneity of base frequencies across

taxa

stationarity / base compositional equilibrium

LogDet Distances

covariotide / covarion effect

rogue taxa

ILD test

midpoint rooting

Terms - from lecture & readings


10

What are four things to investigate if parts of the tree seem“wrong”?

What is the logic, when investigating possible LBA, behind testing

to see if long branches join to different places when analyzedalone?

If long branches are found to be sister taxa when analyzed using

ML and a best-fitting model, what can one say about the possibilityof LBA artifacts?

McCracken & Sorenson (2005) dealt with a problem of short

internal branches - they compared “hard” versus “soft” polytomies.Describe these two types of polytomies - which seems a better

explanation for these author!s results?

Study questions

The mitochondrial genome of vertebrates includes over 16,000

sites - based on Cummings et al (1995) approximately how many(minimum) of these sites are needed to obtain >95% confidence for

all nodes of their 10-species vertebrate phylogeny?

How might one determine if two or more data partitions are

congruent and what alternative actions might be taken (and why) ifthey are found to be incongruent?

What is the best practice for rooting a tree?

Study questions

introduction to biosystematics - zool · pdf fileintroduction to biosystematics - zool 575 2...

Documents