buildingtrees

BuildingTrees

What is a Tree?

• A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays.

• The closer branches share more similarities and the more distant branches are less similar.

Phylogeny (phylo =tribe + genesis)

1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)

2.Character and rate analysis —using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest

crocodiles

birds

lizards

snakesrodents

primates

marsupials

Start with a group of species and establish relationships based on

measurements

crocodiles

birds

lizards

snakes

rodents

primates

marsupials

This is an example of a phylogenetic tree.

Homology & Similarity

• Homology– Conserved sequences arising from a common ancestor– Orthologs: homologous genes that share a common

ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin)

– Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin)

• Similarity– Genes that share common sequences but are not

necessarily related

Sequences As Modules

• Proteins are derived from a limited number of basic building blocks (Modules)

• Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences

• Proteins can share a global or local relationships specific to a single DOMAIN

Global

Local

Sequence DomainsModules Define Functional/Structural DomainsModules Define Functional/Structural Domains

Defining A Sequence Family

Family A

Family B

Family D Family E

Family C

Global vs. Local Alignments

• Global– Search for alignments, matching over

entire sequences

• Local– Examine regions of sequence for

conserved segments

• Both Consider: Matches, Mismatches, Gaps

Global Sequence Alignments

Yeast Prion-Like Proteins

How To Make A Global MSA• On The Web

– http://pir.georgetown.edu/pirwww/search/multaln.html

• On Your Computer– ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/

MSA Example Sequences

>KSYK_HUMANFFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH

>ZA70_HUMANWYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL

>KSYK_PIGWFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY

>MATK_HUMANWFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY

>CSK_CHICKWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY

>CRKL_HUMANWYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY

>YES_XIPHEWYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY

>FGR_HUMANWYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY

>SRC_RSVPWYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY

Standard FASTA Sequence Format

MSA Example ResultYES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLFGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLSRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLMATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHRCSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYSCRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSLZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQDKSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKDKSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** :

YES_XIPHE DNGGYYITTRTQFMSLQMLVKHYFGR_HUMAN DMGGYYITTRVQFNSVQELVQHYSRC_RSVP YSGGFYITSRTQFGSLQQLVAYYMATK_HUMAN -DGHLTIDEAVFFCNLMDMVEHYCSK_CHICK -SSKLSIDEEVYFENLMQLVEHYCRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFYZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYLKSYK_PIG KTGKLSIPGGKNFDTLWQLVEHYKSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * . : .

Steps to Build Trees from MSA

1) identify taxa to be considered 2) choose characters (independent, “unit”)3) construct character matrix for each taxon:4) After performing alignment, use

mathematical formula to describe degree of similarity for each taxon: e.g. simple matching coefficient

# matches total # of characters

S =

Steps to Build Trees

5) construct matrix with pairwise S values

6) use clustering technique to produce a tree (dendrogram)

• Unweighted/Equal weighting = all characters given equal consideration– UPGMA (Unweighted Pair Group Method

with Arithmetic Averaging)– Neighbour-joining

• Unweighting is a form of weighting

Taxon

1 2 3 4 5 6 7 8 9 10

A 0 1 1 0 0 0 1 1 1 0

B 0 0 0 1 1 1 0 1 1 1

C 0 0 1 0 0 1 0 0 0 1

D 1 1 0 0 0 1 1 1 1 0

Character Matrix

Taxon A B C D

A -- 0.3 0.4 0.7

B -- 0.5 0.4

C -- 0.3

D --

S-value Matrix

Building Matrices

Joining Clusters into a TreeClosest: A&D = 0.72nd Closest B&C = 0.5When does A&D join B&C ? (A&B) + (A&C) + (D&B) + (D&C)

4= (0.3 + 0.4 + 0.4 + 0.3)/4 = 0.35

Problems

• Different methods or characters = different dendrograms

• If we use all possible characteristics this would be a natural classification

• The tree is an accurate phylogeny if differences in characters between taxa proportional to time elapsed since common ancestor

Convergent Evolution

• Similar phenotypic response to similar ecological conditions

• Different developmental pathways

Reversal of Evolution

• An altered character reverts to the ancestral form.

• In a DNA molecule, a nucleotide position may change from a C to a T and then back to a C. This frog reverted to teeth.

Trees are hypotheses about evolutionary history

• Different methods may result in different trees.

• How to chose between the different models?

• One way is to compare different types of character data and see if the trees make sense.

Haplotype Network in 3 Elephant Species with 3 DNA sequences

Parsimonious choices reflect fewer changes

•The assumptions of parsimony–Reversals and convergence

require more changes–Parsimonious trees represent

best estimates of phylogenetic relationships

Use of DNA, RNA, or Protein

• For phylogeny, DNA can be more informative.– The protein-coding portion of DNA has

synonymous and nonsynonymous substitutions.

– Some DNA changes do not have corresponding protein changes

– See arrows 14, 21, 25, 27, 29 in the retinol-binding protein figure.

For phylogeny, DNA can be more informative.

• If the synonymous substitution rate (dS) is greater than the nonsynonymous substitution rate (dN), the DNA sequence is under negative (purifying) selection.

• This limits change in the sequence.• If dS < dN, positive selection occurs. • For example, a duplicated gene may

evolve rapidly to assume new functions.

Models of nucleotide substitution-Transitions > Transversions

A G

C T

transition

transition

transversiontransversion

• Some substitutions in a DNA sequence alignment can be directly observed: – single nucleotide substitutions– sequential substitutions– coincidental substitutions

• Additional mutational events can be inferred by analysis of ancestral sequences. These changes include – parallel substitutions– convergent substitutions– back substitutions

Advantages of DNA

• Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny.

• See Figure 11.10 (arrows 4-10 and 35-38) • Pseudogenes (nonfunctional genes) are

studied by molecular phylogeny• Rates of transitions and transversions can

be measured. • Transitions: purine (A to G) or pyrimidine

(C to T) substitutions• Transversion: purine to pyrimidine

Protein sequences are also used for phylogeny

• Proteins have 20 states (amino acids) instead of only four for DNA, so there is more phylogenetic information.

• Nucleotides are unordered characters: any one nucleotide can change to any other in one step.

• An ordered character must pass through one or more intermediate states before reaching the final state.

• Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.

Amino acid sequences

• From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence

• Some amino acids can replace one another with relatively little effect on thestructure and function of the final protein while other replacements can befunctionally devastating

• Tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks are used: PAM and BLOSSUM

Sequence-Based Comparisons• Identify sequences within an organism that are

related to each other and/or across different species– Within: Fetal and adult hemoglobin– Across : Human and chimpanzee hemoglobin

• Generate an evolutionary history of related genes• Locate insertions, deletions, and substitutions that

have occurred during evolution

CREATE CREASE -RELAPSE

GREASER

(C) Cysteine(R) Arginine(E) Glutamate(A) Alanine(T) Threonine(S) Serine(L) Leucine(P) Proline(G) Glycine

[Ancestor] [Progenitors]

Multiple Sequence Alignments • Place residues in columns

that are derived from a common ancestral residue

• Identify Matches, Mismatches, and Gaps

• MSA can reveal sequence patterns– Demonstration of homology

between >2 sequences– Identification of functionally

important sites– Protein function prediction– Structure prediction

CRE-A-TE-

CRE-A-SE-

-RELAPSE-

GRE-A-SER

CREATE

CREASE

GREASER

RELAPSE

123456789

SeqA

SeqB

SeqC

SeqD

MSA and Tree Relationship

• “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001)

+R

CREATE

CREASE

CREATE

CRE-A-TE-

CRE-A-SE-

-RELAPSE-

GRE-A-SER

SeqA

SeqB

SeqC

SeqD

T to S

C to G GREASE

CREASE

CREATE

+L +P

-G

Multiple Sequence Alignments

• Confirm that all sequences are homologous• Adjust gap creation and extension penalties as

needed to optimize the alignment• Restrict phylogenetic analysis to regions of the

multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data).

• Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)

Problems in Reconstructing Phylogeny

• Characters sometimes conflict • It is sometimes difficult to tell homology

from homoplasy– Analogy- characters similar because of

convergent evolution – Reversal- character reverts to ancestral form

• With morphological characters, careful examination may distinguish homoplasy (orthologs) from homology

• With molecular characters (DNA/Protein sequences), orthologs sometimes impossible to distinguish from homologs and paralogs.

A Phylogenetic Tree

• Taxon -- Any named group of organisms – evolutionary theory not necessarily involved.

• Clade -- A monophyletic taxon (evolutionary theory utilized)

A phylogenetic tree with branch lengths

• Branch length can be significant…

• In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches 1+2+3 is less than sum of 1+2+4)

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).

These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

genetic change

Taxon A

Taxon B

Taxon C

Taxon D

time

Taxon A

Taxon B

Taxon C

Taxon D

no meaning

Three types of trees

Cladogram Phylogram Ultrametric tree

All show the same evolutionary relationships, or branching orders, between the taxa.

Types of trees: Cladogram

Pagurus bernhardus

Pagurus acadianus

Ellasochirus tenuimanus

Labidochirus splendescens

Lithodes aequispina

Paralithodes camtschatica

Pagurus pollicaris (NE)

Pagurus pollicaris (GU)

Pagurus longicarpus (NE)

Pagurus longicarpus (GU)

Clibanarius vittatus

Coenobita sp.

Artemia salina

t1

t2

cladogramrelative recent common descent.

•Does not imply that ancestors on the same line necessarily speciated at the same time. • t1 can be

before or after t2 but not before t3

t3

(no time scale)

Pagurus bernhardus

Pagurus acadianus



Lithodes aequispina







Coenobita sp.

Artemia salina

0.05

phylogram(additive tree: branch lengths can be summed)

relative recenctcommon descent, and

branch lengths =amount of change

Types of trees: Phylogram

Pagurus bernhardus

Pagurus acadianus



Lithodes aequispina







Coenobita sp.

Artemia salina

0.000.050.100.15

Ultrametric tree(linearized tree)

Amount of change can be scaled to time

Types of trees: Ultrametric

scale = time

divergence

All tree tips are equidistant from the root

Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved,bifurcating phylogeny

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees

There are three possible unrooted trees for four species (A, B, C, D)

A C

B D

Tree 1

A B

C D

Tree 2

A B

D C

Tree 3

1. Goal of phylogenetic tree building methods is discovery which of the possible unrooted trees is "correct".

2. This should be the “true” biological tree, accurately representing the evolutionary history of the species.

3. However, it is only possible to discover the computationally correct or optimal tree for the phylogenetic method of choice.

The number of unrooted trees increases in a greater than exponential manner with number of species (taxa)

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Inferring evolutionary relationships between the taxa requires rooting the

tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:

A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Try it again with the root at another position

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

BB

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

Trick Question Warning

• Sometimes two trees may look very different but, in fact, differ only in the position of the root.

• Don’t forget rotational symmetry!

All of these rearrangements show the same evolutionary relationships between the taxa

B

A

C

D

A

B

D

C

B

C

A

D

B

D

A

C

B

AC

DRooted tree 1a

B

A

C

D

A

B

C

D

By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).

There are two major ways to root trees

A

B

C

D

10

2

3

5

2

By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9

Rooting Using an Outgroup

• The outgroup should be a sequence (or set of sequences) known to be less closely related to the rest of the sequences than they are to each other.

• It should ideally be as closely related as possible to the rest of the sequences while still satisfying the first condition.

• The root must be somewhere between the outgroup and the rest (either on the node or in a branch).

Automatic Rooting

• Many software packages will root trees automatically (e.g. mid-point rooting in NJPlot)

• This normally involves assumptions… BE AWARE what those are.

x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rooted anywhere along any of its branches

Molecular phylogenetic tree building methodsMathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available, each with strengths and weaknesses.

C lustering algorithmO ptim ality criterion

DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARS I M ON Y

UPGM A

N EI GHBO R-M I N I MUM EVO LUTI O N

LEAS T S QUARES

Ch

arac

ters

J O I N I N G

COMPUTATI ONAL METHOD

FI TCH & MARG OLI AS H

Types of data used in phylogenetic inferenceCharacter-based methods: Use the aligned characters, such as DNA

or protein sequences, directly during tree inference. Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Example 1: Uncorrected“p” distance(=observed percentsequence difference)

Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)

Similarity vs. Evolutionary Relationship

Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.

Similar: having likeness or resemblance (an observation)

Related: genetically connected (an historical fact)

Two taxa can be most similar without being most closely-related:

Taxon A

Taxon B

Taxon C

Taxon D

1

1

1

6

3

5

C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).

Character-based methods can tease apart types of similarity and theoreticallyfind the true evolutionary tree. Similarity = relationship only if certain conditionsare met (if the distances are ‘ultrametric’).

Types of Similarity

Observed similarity between two entities can be due to:

Evolutionary relationship:Shared ancestral characters (‘plesiomorphies’)Shared derived characters (‘’synapomorphy’)

Homoplasy (independent evolution of the same character):Convergent events (in either related on unrelated entities),Parallel events (in related entities), Reversals (in related entities)

CC

G

G

C

C

G

G

CG

G C

C

G

GT

METRIC DISTANCES between any two or three taxa(a, b, and c) have the following properties:

Property 1: d (a, b) ≥ 0 Non-negativity

Property 2: d (a, b) = d (b, a) Symmetry

Property 3: d (a, b) = 0 if and only if a = b Distinctness

Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:

a

b

c6

9

5

ULTRAMETRIC DISTANCESmust satisfy the previous four conditions, plus:

Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)]

If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.

a b4

66

c

Similarity = Relationship if the distances are ultrametric!

a

b

c

2

22

4

This implies that the two largest distances are equal, so that they define an isosceles triangle:

ADDITIVE DISTANCES:

Property 6:

d (a, b) + d (c, d) ≤ maximum [d (a, c) + d (b, d), d (a, d) + d (b, c)]

For distances to fit into an evolutionary tree, they must be either metric or ultrametric, and they must be additive. Estimated distances often fall short of these criteria, and thus can fail to produce correct evolutionary trees.

Tree-building methods: UPGMA

UPGMA is: unweighted pair group methodusing arithmetic mean

1 2

3

4

5


Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.

1 2

3

4

5


Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6


Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6

4 5

7

Step 4: Keep going. Cluster.

1 2

3

4

5 1 2

6

4 5

7

3

8



Step 4: Last cluster! This is your tree.

1 2

3

4

5

1 2

6

4 5

7

3

8

9

Distance-based methods: UPGMA trees

• UPGMA is a simple approach for making trees.

• An UPGMA tree is always rooted.• An assumption of the algorithm is that the

molecular clock is constant for sequences in the tree.

• If there are unequal substitution rates, the tree may be wrong.

• While UPGMA is simple, it is less accurate than the neighbor-joining approach

The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.

Begin by placing all the taxa in a star-like structure.

Making trees using Neighbor-Joining

Tree-building methods: Neighbor joining

Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.

Tree-building methods: Neighbor joining

Define the distance from X to Y by

dXY = 1/2(d1Y + d2Y – d12)

Example of aneighbor-joiningtree: phylogeneticanalysis of 13Retinol BindingProteins

Tree-building methods: character based

• Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters).

• Tree-building methods based on characters include maximum parsimony and maximum likelihood.

Making trees using character-based methods

• The main idea of character-based methods is to find the tree with the shortest branch lengths possible: the most parsimonious (“simple”) tree.

• Identify informative sites. For example, constant characters are not parsimony-informative.

• Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.

• Select the shortest tree (or trees).

As an example of tree-building using maximum parsimony, consider these four taxa:

AAGAAAGGAAGA

How might they have evolved from a common ancestor such as AAA?

AAG AAA GGA AGA

AAAAAA

1 1AGA

AAG AGA AAA GGA

AAAAAA

1 2AAA

AAG GGA AAA AGA

AAAAAA

1 1AAA

1 2

Tree-building methods: Maximum parsimony

Cost = 3 Cost = 4 Cost = 4

1

In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Types of computational methods

Clustering algorithms:

• Use pairwise distances. • Are purely algorithmic methods, in which

the algorithm itself defines the the tree selection criterion.

• Tend to be very fast programs that produce singular trees rooted by distance.

• No objective function to compare to other trees, even if numerous other trees could explain the data equally well.

• Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree.

Optimality approaches:

• Use either character or distance data.• First define an optimality criterion

(minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function.

• Can identify many equally optimal trees, if such exist.

• Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.

Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building:

Exhaustive search: Evaluates all possible unrooted

trees, choosing the one with the best score for the method.

Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searchesoften operate by “hill-climbing” methods.

Computational methods for finding optimal trees:

Exact searches become increasingly difficult, andeventually impossible, as the number of taxa

increases:

(2N - 5)!! = # unrooted trees for N taxa

A D

B E

C

CA

B D

A B

C

A D

B E

C

F

Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima

Rerunning heuristic searches using different input orders of taxa can help

find global minima or maxima

Searchfor global minimum GLOBAL

MAXIMUM

GLOBALMINIMUM

localminimum

localmaximum

Searchfor globalmaximum

GLOBALMAXIMUM

GLOBALMINIMUM

Assumptions made by phylogenetic methods:

• The sequences are correct• The sequence are homologous• Each position is homologous• The sampling of taxa or genes is sufficient to

resolve the problem of interest• Sequence variation is representative of the

broader group of interest• Sequence variation contains sufficient

phylogenetic signal (as opposed to noise) to resolve the problem of interest

• Each position in the sequence evolved independently

Problems with Phylogenetic Inference

1. How do we know what the potential candidate trees are?

2. How do we choose which tree is (most likely) the true tree?

3. The best tree is the one that produces consistent results.

Recipe for reconstructing a phylogeny

1. Select an optimality criterion2. Select a search strategy3. Use the selected search

strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far.

How do you know the “best” tree?Which is the “true” tree?

Search strategy: Which is the right tree?

• When m is the number of taxa, the number of possible trees is:– [(2m-3)!]/[2m-2(m-2)!]– For 10 taxa, the number of trees is 34,459,425

• Many trees can be discarded because they are obviously wrong

• Sometimes, there is a general or even specific grouping that can serve as a start for the tree search

• There are a number of approaches to tree searches that can be used

Evaluating the best tree

• Maximum likelihood (ML) tests the hypothesis by using a mathematical formula that– Tests the probability that a nucleotide

substitution will occur– Tests a tree with known branch lengths and

how likely the DNA sequences will occur.– If similar trees have the same probability as

the one with higher likelihood then the hypothesis is weaker.

Evaluating the best tree

• Bayesian Markov Chain Monte Carlo (BMCMC)– Asks the probability that a particular tree is

correct given data and a model of how traits change

• Distance methods– Looks at the changes in a character and

converts it into distance– Assumes a specific model of character changes

are clustered so that the more similar forms are close together.

Current strategy

• Produce a consensus tree using parsimony

• Evaluated best tree using statistical tests found in MC and BMCMC

• Compare the best trees using parsiomony MC and BMCMC

• The best tree is the one that produces consistent results

Evaluating branches

• Evaluation is done statistically• Trees based on MC and BMCMC

compare tress with and without the branch

• Trees based on maximum parsimony use bootstrapping

Bootstrapping

• In bootstrapping, for example you analyze a 300 base pair of a gene and the computer program makes 300 choices out of that sequence to determine the frequency that the branch in question would occur in all of the trees that are generated.

• If the branch occurs under 50% of the time, there is too much uncertainty and the branch is collapsed into a polytomy or point of uncertainty.

• Bootstrap support of around 70% is associated with true phylogeny.

Resolving conflict

• Researchers have more confidence in trees that use– Larger data sets– Characters that are not subject to

homoplasy– Appropriate inference methods

• Sometimes researchers have to wait for more data

Molecular clocks

• Timing and rate of evolution can be determined by looking at changing molecular traits.

• Changes in DNA sequences that are not tied to phenotypes and cannot be selected against can be tracked

• Neutral theory of molecular evolution predicts neutral changes in DNA should occur at the same rate as mutation rate.

Caveats for molecular dating

• Mutation rates to neutral alleles vary in the different genes, lineages and bases

• Third position of codons are more likely neutral and change in a clock-like fashion

• Rapid changes in allele frequencies in response to selection pressure produce unreliable clocks

• Calibration rates for a particular gene or lineage cannot be used for other groups that have different generation times and selection histories

When did humans start wearing clothes?

Pediculus corporis

Pediculus capitis

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VRT-49BYF2J-S&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=cee355b413b9d224336addf4142eb034

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VRT-49BYF2J-S&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=cee355b413b9d224336addf4142eb034

Kittler et al., 2003

Summary of the lousy results

• Greater diversity in African than in non-African lice

• Lice probably originated in Africa along with humans

• Clothing appeared 30,000 to 114,000 years ago• The expansion of lice diversity represents the

migration of humans out of Africa• Clothing may have allowed the successful

movement of humans out of Africa into colder climates.

Which species are the closest living relatives of modern humans?

Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.

The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

MYA

Chimpanzees

Orangutans Humans

Bonobos

GorillasHumans

Bonobos

Gorillas Orangutans

Chimpanzees

MYA015-30014

Did the Florida Dentist infect his patients with HIV?

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998)

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:

What was the most likely geographical location of the

common ancestor of the African apes and humans?

Eurasia = Black Africa = Red

= Dispersal

Modified from: Stewart, C.-B. & Disotell,T.R. (1998) Current Biology 8: R582-588.

Scenario B requires fourfewer dispersal events

OW Monkeys

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Chimpanzees

Humans

Gorillas

Orangutans

Gibbons

Ouranopithecus

Dryopithecus

Lufengpithecus

Living Species

Living + Fossil Species

Oreopithecus

Proconsul

OW Monkeys

OW Monkeys

Kenyapithecus

OW Monkeys

Kenyapithecus

Proconsul

Ouranopithecus

Dryopithecus

Lufengpithecus

Oreopithecus

Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland

How can we choose between competing hypotheses on phylogeny of whales?

Phylogenetic Reconstruction of Whales

• Whales belong to artiodactyla (ungulate mammals), which includes camels, pigs, hippos, cows, deer

• Outgroup is rhinos/horses • Difficult to place them because they lack

many characters present in terrestrial mammals (e.g. hind limbs)

• Are whales sister to entire group or to hippos?

DNA Sequence Data and Whale Evolution

• Data collected from beta-casein gene for all taxa and sequences aligned.

• Nucleotide changes between outgroup and ingroup species indicate shared derived homologies.

• Most nucleotides are identical in all taxa, these are uninformative for phylogeny.

• Some nucleotides indicate that whales belong with cows, deer, and hippos (162).

• Others indicate that whales and hippos are sister groups (166).

• Others contradict sister group status of whale/hippo and cow deer (177) and may indicate a reversal.

Phylogeny results should be treated as informative but not

authoritative

buildingtrees

Documents

common sequences

phylogenetic tree

phylogeny inference

local relationships

gene duplication mouse

homologous genes

common ancestororthologs

evolutionary relationships