buildingtrees
DESCRIPTION
BuildingTrees. What is a Tree?. A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays. - PowerPoint PPT PresentationTRANSCRIPT
BuildingTrees
What is a Tree?
• A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays.
• The closer branches share more similarities and the more distant branches are less similar.
Phylogeny (phylo =tribe + genesis)
1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)
2.Character and rate analysis —using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest
crocodiles
birds
lizards
snakesrodents
primates
marsupials
Start with a group of species and establish relationships based on
measurements
crocodiles
birds
lizards
snakes
rodents
primates
marsupials
This is an example of a phylogenetic tree.
Homology & Similarity
• Homology– Conserved sequences arising from a common ancestor– Orthologs: homologous genes that share a common
ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin)
– Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin)
• Similarity– Genes that share common sequences but are not
necessarily related
Sequences As Modules
• Proteins are derived from a limited number of basic building blocks (Modules)
• Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences
• Proteins can share a global or local relationships specific to a single DOMAIN
Global
Local
Sequence DomainsModules Define Functional/Structural DomainsModules Define Functional/Structural Domains
Defining A Sequence Family
Family A
Family B
Family D Family E
Family C
Global vs. Local Alignments
• Global– Search for alignments, matching over
entire sequences
• Local– Examine regions of sequence for
conserved segments
• Both Consider: Matches, Mismatches, Gaps
Global Sequence Alignments
Yeast Prion-Like Proteins
How To Make A Global MSA• On The Web
– http://pir.georgetown.edu/pirwww/search/multaln.html
• On Your Computer– ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/
MSA Example Sequences
>KSYK_HUMANFFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH
>ZA70_HUMANWYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL
>KSYK_PIGWFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY
>MATK_HUMANWFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY
>CSK_CHICKWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY
>CRKL_HUMANWYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY
>YES_XIPHEWYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY
>FGR_HUMANWYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY
>SRC_RSVPWYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY
Standard FASTA Sequence Format
MSA Example ResultYES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLFGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLSRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLMATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHRCSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYSCRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSLZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQDKSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKDKSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** :
YES_XIPHE DNGGYYITTRTQFMSLQMLVKHYFGR_HUMAN DMGGYYITTRVQFNSVQELVQHYSRC_RSVP YSGGFYITSRTQFGSLQQLVAYYMATK_HUMAN -DGHLTIDEAVFFCNLMDMVEHYCSK_CHICK -SSKLSIDEEVYFENLMQLVEHYCRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFYZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYLKSYK_PIG KTGKLSIPGGKNFDTLWQLVEHYKSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * . : .
Steps to Build Trees from MSA
1) identify taxa to be considered 2) choose characters (independent, “unit”)3) construct character matrix for each taxon:4) After performing alignment, use
mathematical formula to describe degree of similarity for each taxon: e.g. simple matching coefficient
# matches total # of characters
S =
Steps to Build Trees
5) construct matrix with pairwise S values
6) use clustering technique to produce a tree (dendrogram)
• Unweighted/Equal weighting = all characters given equal consideration– UPGMA (Unweighted Pair Group Method
with Arithmetic Averaging)– Neighbour-joining
• Unweighting is a form of weighting
Taxon
1 2 3 4 5 6 7 8 9 10
A 0 1 1 0 0 0 1 1 1 0
B 0 0 0 1 1 1 0 1 1 1
C 0 0 1 0 0 1 0 0 0 1
D 1 1 0 0 0 1 1 1 1 0
Character Matrix
Taxon A B C D
A -- 0.3 0.4 0.7
B -- 0.5 0.4
C -- 0.3
D --
S-value Matrix
Building Matrices
Joining Clusters into a TreeClosest: A&D = 0.72nd Closest B&C = 0.5When does A&D join B&C ? (A&B) + (A&C) + (D&B) + (D&C)
4= (0.3 + 0.4 + 0.4 + 0.3)/4 = 0.35
Problems
• Different methods or characters = different dendrograms
• If we use all possible characteristics this would be a natural classification
• The tree is an accurate phylogeny if differences in characters between taxa proportional to time elapsed since common ancestor
Convergent Evolution
• Similar phenotypic response to similar ecological conditions
• Different developmental pathways
Reversal of Evolution
• An altered character reverts to the ancestral form.
• In a DNA molecule, a nucleotide position may change from a C to a T and then back to a C. This frog reverted to teeth.
Trees are hypotheses about evolutionary history
• Different methods may result in different trees.
• How to chose between the different models?
• One way is to compare different types of character data and see if the trees make sense.
Haplotype Network in 3 Elephant Species with 3 DNA sequences
Parsimonious choices reflect fewer changes
•The assumptions of parsimony–Reversals and convergence
require more changes–Parsimonious trees represent
best estimates of phylogenetic relationships
Use of DNA, RNA, or Protein
• For phylogeny, DNA can be more informative.– The protein-coding portion of DNA has
synonymous and nonsynonymous substitutions.
– Some DNA changes do not have corresponding protein changes
– See arrows 14, 21, 25, 27, 29 in the retinol-binding protein figure.
For phylogeny, DNA can be more informative.
• If the synonymous substitution rate (dS) is greater than the nonsynonymous substitution rate (dN), the DNA sequence is under negative (purifying) selection.
• This limits change in the sequence.• If dS < dN, positive selection occurs. • For example, a duplicated gene may
evolve rapidly to assume new functions.
Models of nucleotide substitution-Transitions > Transversions
A G
C T
transition
transition
transversiontransversion
• Some substitutions in a DNA sequence alignment can be directly observed: – single nucleotide substitutions– sequential substitutions– coincidental substitutions
• Additional mutational events can be inferred by analysis of ancestral sequences. These changes include – parallel substitutions– convergent substitutions– back substitutions
Advantages of DNA
• Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny.
• See Figure 11.10 (arrows 4-10 and 35-38) • Pseudogenes (nonfunctional genes) are
studied by molecular phylogeny• Rates of transitions and transversions can
be measured. • Transitions: purine (A to G) or pyrimidine
(C to T) substitutions• Transversion: purine to pyrimidine
Protein sequences are also used for phylogeny
• Proteins have 20 states (amino acids) instead of only four for DNA, so there is more phylogenetic information.
• Nucleotides are unordered characters: any one nucleotide can change to any other in one step.
• An ordered character must pass through one or more intermediate states before reaching the final state.
• Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.
Amino acid sequences
• From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence
• Some amino acids can replace one another with relatively little effect on thestructure and function of the final protein while other replacements can befunctionally devastating
• Tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks are used: PAM and BLOSSUM
Sequence-Based Comparisons• Identify sequences within an organism that are
related to each other and/or across different species– Within: Fetal and adult hemoglobin– Across : Human and chimpanzee hemoglobin
• Generate an evolutionary history of related genes• Locate insertions, deletions, and substitutions that
have occurred during evolution
CREATE CREASE -RELAPSE
GREASER
(C) Cysteine(R) Arginine(E) Glutamate(A) Alanine(T) Threonine(S) Serine(L) Leucine(P) Proline(G) Glycine
[Ancestor] [Progenitors]
Multiple Sequence Alignments • Place residues in columns
that are derived from a common ancestral residue
• Identify Matches, Mismatches, and Gaps
• MSA can reveal sequence patterns– Demonstration of homology
between >2 sequences– Identification of functionally
important sites– Protein function prediction– Structure prediction
CRE-A-TE-
CRE-A-SE-
-RELAPSE-
GRE-A-SER
CREATE
CREASE
GREASER
RELAPSE
123456789
SeqA
SeqB
SeqC
SeqD
MSA and Tree Relationship
• “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001)
+R
CREATE
CREASE
CREATE
CRE-A-TE-
CRE-A-SE-
-RELAPSE-
GRE-A-SER
SeqA
SeqB
SeqC
SeqD
T to S
C to G GREASE
CREASE
CREATE
+L +P
-G
Multiple Sequence Alignments
• Confirm that all sequences are homologous• Adjust gap creation and extension penalties as
needed to optimize the alignment• Restrict phylogenetic analysis to regions of the
multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data).
• Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)
Problems in Reconstructing Phylogeny
• Characters sometimes conflict • It is sometimes difficult to tell homology
from homoplasy– Analogy- characters similar because of
convergent evolution – Reversal- character reverts to ancestral form
• With morphological characters, careful examination may distinguish homoplasy (orthologs) from homology
• With molecular characters (DNA/Protein sequences), orthologs sometimes impossible to distinguish from homologs and paralogs.
A Phylogenetic Tree
• Taxon -- Any named group of organisms – evolutionary theory not necessarily involved.
• Clade -- A monophyletic taxon (evolutionary theory utilized)
A phylogenetic tree with branch lengths
• Branch length can be significant…
• In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches 1+2+3 is less than sum of 1+2+4)
Ancestral Node or ROOT of
the TreeInternal Nodes orDivergence Points
(represent hypothetical ancestors of the taxa)
Branches or Lineages
Terminal Nodes
A
B
C
D
E
Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny
Common Phylogenetic Tree Terminology
Phylogenetic trees diagram the evolutionary relationships between the taxa
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
Taxon A
Taxon B
Taxon C
Taxon E
Taxon D
No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.
This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).
These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.
Taxon A
Taxon B
Taxon C
Taxon D
1
1
1
6
3
5
genetic change
Taxon A
Taxon B
Taxon C
Taxon D
time
Taxon A
Taxon B
Taxon C
Taxon D
no meaning
Three types of trees
Cladogram Phylogram Ultrametric tree
All show the same evolutionary relationships, or branching orders, between the taxa.
Types of trees: Cladogram
Pagurus bernhardus
Pagurus acadianus
Ellasochirus tenuimanus
Labidochirus splendescens
Lithodes aequispina
Paralithodes camtschatica
Pagurus pollicaris (NE)
Pagurus pollicaris (GU)
Pagurus longicarpus (NE)
Pagurus longicarpus (GU)
Clibanarius vittatus
Coenobita sp.
Artemia salina
t1
t2
cladogramrelative recent common descent.
•Does not imply that ancestors on the same line necessarily speciated at the same time. • t1 can be
before or after t2 but not before t3
t3
(no time scale)
Pagurus bernhardus
Pagurus acadianus
Ellasochirus tenuimanus
Labidochirus splendescens
Lithodes aequispina
Paralithodes camtschatica
Pagurus pollicaris (NE)
Pagurus pollicaris (GU)
Pagurus longicarpus (NE)
Pagurus longicarpus (GU)
Clibanarius vittatus
Coenobita sp.
Artemia salina
0.05
phylogram(additive tree: branch lengths can be summed)
relative recenctcommon descent, and
branch lengths =amount of change
Types of trees: Phylogram
Pagurus bernhardus
Pagurus acadianus
Ellasochirus tenuimanus
Labidochirus splendescens
Lithodes aequispina
Paralithodes camtschatica
Pagurus pollicaris (NE)
Pagurus pollicaris (GU)
Pagurus longicarpus (NE)
Pagurus longicarpus (GU)
Clibanarius vittatus
Coenobita sp.
Artemia salina
0.000.050.100.15
Ultrametric tree(linearized tree)
Amount of change can be scaled to time
Types of trees: Ultrametric
scale = time
divergence
All tree tips are equidistant from the root
Completely unresolvedor "star" phylogeny
Partially resolvedphylogeny
Fully resolved,bifurcating phylogeny
A A A
B
B B
C
C
C
E
E
E
D
D D
Polytomy or multifurcation A bifurcation
The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees
There are three possible unrooted trees for four species (A, B, C, D)
A C
B D
Tree 1
A B
C D
Tree 2
A B
D C
Tree 3
1. Goal of phylogenetic tree building methods is discovery which of the possible unrooted trees is "correct".
2. This should be the “true” biological tree, accurately representing the evolutionary history of the species.
3. However, it is only possible to discover the computationally correct or optimal tree for the phylogenetic method of choice.
The number of unrooted trees increases in a greater than exponential manner with number of species (taxa)
(2N - 5)!! = # unrooted trees for N taxa
CA
B D
A B
C
A D
B E
C
A D
B E
C
F
Inferring evolutionary relationships between the taxa requires rooting the
tree:
To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:
A
BC
Root D
A B C D
RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.
Rooted tree
Unrooted tree
Try it again with the root at another position
A
BC
Root
D
Unrooted tree
Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.
C D
Root
Rooted tree
A
BB
An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees
The unrooted tree 1:
A C
B D
Rooted tree 1d
C
D
A
B
4
Rooted tree 1c
A
B
C
D
3
Rooted tree 1e
D
C
A
B
5
Rooted tree 1b
A
B
C
D
2
Rooted tree 1a
B
A
C
D
1
These trees show five different evolutionary relationships among the taxa!
Trick Question Warning
• Sometimes two trees may look very different but, in fact, differ only in the position of the root.
• Don’t forget rotational symmetry!
All of these rearrangements show the same evolutionary relationships between the taxa
B
A
C
D
A
B
D
C
B
C
A
D
B
D
A
C
B
AC
DRooted tree 1a
B
A
C
D
A
B
C
D
By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).
There are two major ways to root trees
A
B
C
D
10
2
3
5
2
By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.
outgroup
d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9
Rooting Using an Outgroup
• The outgroup should be a sequence (or set of sequences) known to be less closely related to the rest of the sequences than they are to each other.
• It should ideally be as closely related as possible to the rest of the sequences while still satisfying the first condition.
• The root must be somewhere between the outgroup and the rest (either on the node or in a branch).
Automatic Rooting
• Many software packages will root trees automatically (e.g. mid-point rooting in NJPlot)
• This normally involves assumptions… BE AWARE what those are.
x =
CA
B D
A D
B E
C
A D
B E
C
F (2N - 3)!! = # unrooted trees for N taxa
Each unrooted tree theoretically can be rooted anywhere along any of its branches
Molecular phylogenetic tree building methodsMathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available, each with strengths and weaknesses.
C lustering algorithmO ptim ality criterion
DA
TA
TY
PE
Ch
arac
ters
Dis
tan
ces
PARS I M ON Y
UPGM A
N EI GHBO R-M I N I MUM EVO LUTI O N
LEAS T S QUARES
Ch
arac
ters
J O I N I N G
COMPUTATI ONAL METHOD
FI TCH & MARG OLI AS H
Types of data used in phylogenetic inferenceCharacter-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference. Taxa Characters
Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG
Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.
A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----
Example 1: Uncorrected“p” distance(=observed percentsequence difference)
Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)
Similarity vs. Evolutionary Relationship
Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.
Similar: having likeness or resemblance (an observation)
Related: genetically connected (an historical fact)
Two taxa can be most similar without being most closely-related:
Taxon A
Taxon B
Taxon C
Taxon D
1
1
1
6
3
5
C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).
Character-based methods can tease apart types of similarity and theoreticallyfind the true evolutionary tree. Similarity = relationship only if certain conditionsare met (if the distances are ‘ultrametric’).
Types of Similarity
Observed similarity between two entities can be due to:
Evolutionary relationship:Shared ancestral characters (‘plesiomorphies’)Shared derived characters (‘’synapomorphy’)
Homoplasy (independent evolution of the same character):Convergent events (in either related on unrelated entities),Parallel events (in related entities), Reversals (in related entities)
CC
G
G
C
C
G
G
CG
G C
C
G
GT
METRIC DISTANCES between any two or three taxa(a, b, and c) have the following properties:
Property 1: d (a, b) ≥ 0 Non-negativity
Property 2: d (a, b) = d (b, a) Symmetry
Property 3: d (a, b) = 0 if and only if a = b Distinctness
Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:
a
b
c6
9
5
ULTRAMETRIC DISTANCESmust satisfy the previous four conditions, plus:
Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)]
If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.
a b4
66
c
Similarity = Relationship if the distances are ultrametric!
a
b
c
2
22
4
This implies that the two largest distances are equal, so that they define an isosceles triangle:
ADDITIVE DISTANCES:
Property 6:
d (a, b) + d (c, d) ≤ maximum [d (a, c) + d (b, d), d (a, d) + d (b, c)]
For distances to fit into an evolutionary tree, they must be either metric or ultrametric, and they must be additive. Estimated distances often fall short of these criteria, and thus can fail to produce correct evolutionary trees.
Tree-building methods: UPGMA
UPGMA is: unweighted pair group methodusing arithmetic mean
1 2
3
4
5
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.
1 2
3
4
5
Tree-building methods: UPGMA
Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
4 5
7
Step 4: Keep going. Cluster.
1 2
3
4
5 1 2
6
4 5
7
3
8
Tree-building methods: UPGMA
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
1 2
3
4
5
1 2
6
4 5
7
3
8
9
Distance-based methods: UPGMA trees
• UPGMA is a simple approach for making trees.
• An UPGMA tree is always rooted.• An assumption of the algorithm is that the
molecular clock is constant for sequences in the tree.
• If there are unequal substitution rates, the tree may be wrong.
• While UPGMA is simple, it is less accurate than the neighbor-joining approach
The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.
Begin by placing all the taxa in a star-like structure.
Making trees using Neighbor-Joining
Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
Example of aneighbor-joiningtree: phylogeneticanalysis of 13Retinol BindingProteins
Tree-building methods: character based
• Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters).
• Tree-building methods based on characters include maximum parsimony and maximum likelihood.
Making trees using character-based methods
• The main idea of character-based methods is to find the tree with the shortest branch lengths possible: the most parsimonious (“simple”) tree.
• Identify informative sites. For example, constant characters are not parsimony-informative.
• Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.
• Select the shortest tree (or trees).
As an example of tree-building using maximum parsimony, consider these four taxa:
AAGAAAGGAAGA
How might they have evolved from a common ancestor such as AAA?
AAG AAA GGA AGA
AAAAAA
1 1AGA
AAG AGA AAA GGA
AAAAAA
1 2AAA
AAG GGA AAA AGA
AAAAAA
1 1AAA
1 2
Tree-building methods: Maximum parsimony
Cost = 3 Cost = 4 Cost = 4
1
In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).
Types of computational methods
Clustering algorithms:
• Use pairwise distances. • Are purely algorithmic methods, in which
the algorithm itself defines the the tree selection criterion.
• Tend to be very fast programs that produce singular trees rooted by distance.
• No objective function to compare to other trees, even if numerous other trees could explain the data equally well.
• Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree.
Optimality approaches:
• Use either character or distance data.• First define an optimality criterion
(minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function.
• Can identify many equally optimal trees, if such exist.
• Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.
Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building:
Exhaustive search: Evaluates all possible unrooted
trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions.
Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searchesoften operate by “hill-climbing” methods.
Computational methods for finding optimal trees:
Exact searches become increasingly difficult, andeventually impossible, as the number of taxa
increases:
(2N - 5)!! = # unrooted trees for N taxa
A D
B E
C
CA
B D
A B
C
A D
B E
C
F
Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima
Rerunning heuristic searches using different input orders of taxa can help
find global minima or maxima
Searchfor global minimum GLOBAL
MAXIMUM
GLOBALMINIMUM
localminimum
localmaximum
Searchfor globalmaximum
GLOBALMAXIMUM
GLOBALMINIMUM
Assumptions made by phylogenetic methods:
• The sequences are correct• The sequence are homologous• Each position is homologous• The sampling of taxa or genes is sufficient to
resolve the problem of interest• Sequence variation is representative of the
broader group of interest• Sequence variation contains sufficient
phylogenetic signal (as opposed to noise) to resolve the problem of interest
• Each position in the sequence evolved independently
Problems with Phylogenetic Inference
1. How do we know what the potential candidate trees are?
2. How do we choose which tree is (most likely) the true tree?
3. The best tree is the one that produces consistent results.
Recipe for reconstructing a phylogeny
1. Select an optimality criterion2. Select a search strategy3. Use the selected search
strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far.
How do you know the “best” tree?Which is the “true” tree?
Search strategy: Which is the right tree?
• When m is the number of taxa, the number of possible trees is:– [(2m-3)!]/[2m-2(m-2)!]– For 10 taxa, the number of trees is 34,459,425
• Many trees can be discarded because they are obviously wrong
• Sometimes, there is a general or even specific grouping that can serve as a start for the tree search
• There are a number of approaches to tree searches that can be used
Evaluating the best tree
• Maximum likelihood (ML) tests the hypothesis by using a mathematical formula that– Tests the probability that a nucleotide
substitution will occur– Tests a tree with known branch lengths and
how likely the DNA sequences will occur.– If similar trees have the same probability as
the one with higher likelihood then the hypothesis is weaker.
Evaluating the best tree
• Bayesian Markov Chain Monte Carlo (BMCMC)– Asks the probability that a particular tree is
correct given data and a model of how traits change
• Distance methods– Looks at the changes in a character and
converts it into distance– Assumes a specific model of character changes
are clustered so that the more similar forms are close together.
Current strategy
• Produce a consensus tree using parsimony
• Evaluated best tree using statistical tests found in MC and BMCMC
• Compare the best trees using parsiomony MC and BMCMC
• The best tree is the one that produces consistent results
Evaluating branches
• Evaluation is done statistically• Trees based on MC and BMCMC
compare tress with and without the branch
• Trees based on maximum parsimony use bootstrapping
Bootstrapping
• In bootstrapping, for example you analyze a 300 base pair of a gene and the computer program makes 300 choices out of that sequence to determine the frequency that the branch in question would occur in all of the trees that are generated.
• If the branch occurs under 50% of the time, there is too much uncertainty and the branch is collapsed into a polytomy or point of uncertainty.
• Bootstrap support of around 70% is associated with true phylogeny.
Resolving conflict
• Researchers have more confidence in trees that use– Larger data sets– Characters that are not subject to
homoplasy– Appropriate inference methods
• Sometimes researchers have to wait for more data
Molecular clocks
• Timing and rate of evolution can be determined by looking at changing molecular traits.
• Changes in DNA sequences that are not tied to phenotypes and cannot be selected against can be tracked
• Neutral theory of molecular evolution predicts neutral changes in DNA should occur at the same rate as mutation rate.
Caveats for molecular dating
• Mutation rates to neutral alleles vary in the different genes, lineages and bases
• Third position of codons are more likely neutral and change in a clock-like fashion
• Rapid changes in allele frequencies in response to selection pressure produce unreliable clocks
• Calibration rates for a particular gene or lineage cannot be used for other groups that have different generation times and selection histories
When did humans start wearing clothes?
Pediculus corporis
Pediculus capitis
Kittler et al., 2003
Summary of the lousy results
• Greater diversity in African than in non-African lice
• Lice probably originated in Africa along with humans
• Clothing appeared 30,000 to 114,000 years ago• The expansion of lice diversity represents the
migration of humans out of Africa• Clothing may have allowed the successful
movement of humans out of Africa into colder climates.
Which species are the closest living relatives of modern humans?
Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.
The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.
MYA
Chimpanzees
Orangutans Humans
Bonobos
GorillasHumans
Bonobos
Gorillas Orangutans
Chimpanzees
MYA015-30014
Did the Florida Dentist infect his patients with HIV?
DENTIST
DENTIST
Patient D
Patient F
Patient C
Patient A
Patient G
Patient BPatient E
Patient A
Local control 2
Local control 3
Local control 9
Local control 35
Local control 3
Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.
No
No
From Ou et al. (1992) and Page & Holmes (1998)
Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:
What was the most likely geographical location of the
common ancestor of the African apes and humans?
Eurasia = Black Africa = Red
= Dispersal
Modified from: Stewart, C.-B. & Disotell,T.R. (1998) Current Biology 8: R582-588.
Scenario B requires fourfewer dispersal events
OW Monkeys
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Chimpanzees
Humans
Gorillas
Orangutans
Gibbons
Ouranopithecus
Dryopithecus
Lufengpithecus
Living Species
Living + Fossil Species
Oreopithecus
Proconsul
OW Monkeys
OW Monkeys
Kenyapithecus
OW Monkeys
Kenyapithecus
Proconsul
Ouranopithecus
Dryopithecus
Lufengpithecus
Oreopithecus
Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland
How can we choose between competing hypotheses on phylogeny of whales?
Phylogenetic Reconstruction of Whales
• Whales belong to artiodactyla (ungulate mammals), which includes camels, pigs, hippos, cows, deer
• Outgroup is rhinos/horses • Difficult to place them because they lack
many characters present in terrestrial mammals (e.g. hind limbs)
• Are whales sister to entire group or to hippos?
DNA Sequence Data and Whale Evolution
• Data collected from beta-casein gene for all taxa and sequences aligned.
• Nucleotide changes between outgroup and ingroup species indicate shared derived homologies.
• Most nucleotides are identical in all taxa, these are uninformative for phylogeny.
• Some nucleotides indicate that whales belong with cows, deer, and hippos (162).
• Others indicate that whales and hippos are sister groups (166).
• Others contradict sister group status of whale/hippo and cow deer (177) and may indicate a reversal.
Phylogeny results should be treated as informative but not
authoritative