species and gene trees: history, inference, and visualization - joseph heled

68
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Upload: australian-bioinformatics-network

Post on 10-May-2015

592 views

Category:

Health & Medicine


1 download

DESCRIPTION

Computer cycles are getting cheaper exponentially for the past 40 years, and the cost of DNA sequencing is declining even faster. Both technological achievements are responsible for the re-birth of phylogenetics, where the digital information in genetic data is used to infer organisms relatedness by employing powerful statistical methods. In my talk I will briefly review the history of inferring species family trees, introduce a recent Bayesian method which uses data from multiple organisms in closely related species, and show how the method output can be visualized to better understand the results.

TRANSCRIPT

Page 1: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Page 2: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Pre Darwin phylogenetic trees

Page 3: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Origin sole figure

Page 4: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Cytochrome C Gene Tree (Fitch, 1967)

Page 5: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

• Processes of speciation

• Evolution of traits

• Biogeography

• Epidemiology

• Co-Evolution (host/parasite)

• Domestication

Page 6: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Selecting a “Duck”

Page 7: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Molecular Clock (early ’60s)

Page 8: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Models of Sequence Evolution

JC69 model (Jukes and Cantor, 1969)

Page 9: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Kingman Coalescent (1982)

Page 10: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Wright-Fisher Population (1931)

• The individuals were randomly sampled

from a population of size N.

• The parent of any individual is chosen

uniformly at random from all potential

parents

Page 11: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Coalescent

The larger the population, the longer (on average)you have to travel back in time for the commonancestor.

Page 12: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Coalescent for multiple individuals

The waiting time for the first common ancestor oftwo individuals out of m (going backwards in time)

is exponential with a rate of (m2)/Ne

.Ne is the Wright-Fisher effective population size.

Page 13: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

From Models to Inference

Page 14: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Bayes’ Theorem (a Reminder)

P(A ∧ B) = P(A)P(B |A)

Page 15: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Bayes’ Theorem (2)

P(B)P(A|B) = P(A ∧ B) = P(A)P(B |A)

P(A|B) = P(B |A)P(A)P(B)

Page 16: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Bayesian Inference

Page 17: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Models (so far)

Substitution model: A stochastic process for the evolution(change) of genetic data (sequences) overtime.

Clock model: How substitution rates change over time.

Coalescent model: A stochastic process for the ancestralrelationship between a group ofhomologous sequences from severalindividuals.

Page 18: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Models (Math Notation)

Coalescent model: f (T |Ne)

Substitution model: f (G |T )

Where G is the gene (sequence data) and T is theancestral relationships (tree).

Page 19: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Biological Species Concept

The conventional definition of “a species” amongstevolutionary biologists is “a group of organisms whosemembers interbreed among themselves, but are separatedfrom other groups by genetically-based barriers to geneflow.”

Jerry Coyne “Why Evolution is True” blog.

Page 20: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Species “tree”

Page 21: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Gene(s) tree

Page 22: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree Ancestral Reconstruction

Page 23: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Multiple Individuals from each Species

Page 24: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Multispecies coalescent – Kingman Coalescent per SpeciesTree Branch

Page 25: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Multiple Independent Loci

Page 26: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Multispecies Posterior

P(S |D) =

∫g

P(S , g |D)

∝∫

g

P(D|S , g)P(S , g)

=

∫g

P(D|g)P(S , g)

=

∫g

P(D|g)P(g |S)P(S)

Page 27: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

A Complex Posterior

P(S |D) =

∫g

f (S , g |D)

=

∫g

f (D|S , g)f (S , g)f (D)

Page 28: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Problem 1: f(D)

The prior probability of obtaining data D.

P(S |D) =

∫g

f (D|S , g)f (S , g)

f(D)

We don’t know the value of f (D).

f (D) =

∫g ,S

f (D|S , g)f (S , g)

However, it is a constant.

Page 29: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Problem 2: The Whole Damn Thing

The posterior is a distribution defined by a complexmultidimensional integral.

Page 30: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Enter MCMC

Markov Chain Monte Carlo (MCMC) is a class ofmethods for stochasticly sampling from probabilitydistributions based on constructing a Markov chain.

Page 31: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Very short history of MCMC

1953: Metropolis algorithm published in Journal ofChemical Physics (Metropolis et al.)

1970: Hastings algorithms in Biometrika (Hastrings)

1974: Gibbs sampler and Hammersley-Clifford theorempaper by Besag

1980s: Image analysis and spatial statistics enjoyed MCMCalgorithms, not popular with others due to the lackof computing power

1995: Reversible jump algorithm in Biometrika (Green)

groundtruth.info/AstroStat/slog/2008/mcmc-historyo

Page 32: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell

Page 33: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (2)

Page 34: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (3)

Page 35: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (4)

Page 36: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (5)

If we propose to go from B or A to either A or B with equalprobability, then

2

1

A B

Flow from A to B is 2/3 · 1/4 = 1/6, and from B to B is1/3 · 1/2 = 1/6, 1/3 in total.

Page 37: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (6)

Hastings Ratio (x to y) =p(y → x)f (y)

p(x → y)f (x)

So far we had p(y → x) = p(x → y), that is theprobability going from x to y was equal to goingback.

Page 38: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (7)

If at A we always propose to go to B but from B we go to A or Bwith equal propability, that is,

Table: p(x → y)

A B

A 0 1B 1/2

1/2

Then

HR(A→ B) =1/2 · 11 · 2 = 1/4,

AndHR(B → A) = 4.

Page 39: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

MCMC in a nutshell (8)

2

1

A B

Flow from B to A is 1/3 · 1/2 and from A to A is 2/3 · 3/4 = 1/2,2/3 in total.

Page 40: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Tree(s) Visualisation

Page 41: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Traditional Tree Visualization

0.0090

sum

oax

Cyan

FLSJ

pot

arib

aria

ult

couc

woll

coast

con

int

ins

0.47

0.62

0.45

1

0.38

0.99

0.96 0.88

0.431

1

0.91

Page 42: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Traditional Tree Visualization (2)

Page 43: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree with Population Sizes

Page 44: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree with Gene Trees

Page 45: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree (Densitree)

Page 46: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The “Star Tree”

Page 47: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree (Densitree)

Page 48: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Taxa Order Matters (When Drawing Multiple Trees)

0 1 2 3 4 5 6

73%

0 1 5 6 2 3 4

17%

2 3 4 5 6 0 1

10%

Page 49: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Some Orders are Better than Others

0 1 2 3 4 5 6

73%

0 1 5 6 2 3 4

17%

2 3 4 5 6 0 1

10%

2 3 4 0 1 5 6

Page 50: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Disadvantages:

• Population size changes inone branch have a visualeffect on other branches.

• Fails when trying to showthe whole posterior a laDensiTree.

• No obvious way to extendfor trees with constantpopulation size per branch.

Page 51: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

The Imperial AT-AT Tree

Provide some space between branches.

Page 52: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

A double Act

Target species tree (blue) and ?BEAST posterior summary(orange).

Page 53: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tree with Constant Population Sizes

To extend to constant branches, we need a rule to place thebottom of the branch on top of the descendant branches. We usethe proportion rule.

Page 54: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Position,Position,Position

A species tree specifies heights and widths. The challenge is topick good X-axis positions.The star tree builds the tree from root towards the tips. Buildingfrom the tips towards the root is simpler when drawing speciestrees. When building from the tips the descendants X-positionsdetermine the parent position.

Page 55: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Position,Position,Position

However, there are many ways to place nodes. Here are four ofthem:

Descendants Mean Halfway between direct descendants.

Tips Mean Average of all tips in the sub-tree.

Middle Halfway between rightmost tip of left sub-treeand leftmost tip of right sub-tree.

Balanced by Population At point minimizing the difference betweenbranch bottom and top centers.

Page 56: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Node Positioning

The methods are similar for balanced trees. The difference is in thehandling of unbalanced trees.

D-MeanT-MeanMiddleBalanced

Page 57: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Node Positioning

D-MeanBalanced

Page 58: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Species Tress Posterior

0 20 40 60 80

0

5

10

15ar

ia

arib

wol

l

coucpo

t

ult

coas

t

sum ins

int

FLSJ

con

oax

Cyan

Page 59: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Gene Trees Within Species Trees: Preliminary

Next we would like to draw the gene tree within the species tree.

Hurdle 1: Obtain a suitable gene tree.

The gene tree has to be compatible with the species tree. This isnot a problem when drawing a specific MCMC state, but is aproblem when using summary trees.

Page 60: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Tips Positioning

Hurdle 2: Branches with non-constant width complicatespositioning of tips.

−1 0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−2 0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−2 0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−5 0 5 10 15 20

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Page 61: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Tips Positioning (automatic)

The placing insures that extrema points are at least ε apart(horizontally).

−2 0 2 4 6 8 100.0

0.5

1.0

1.5

2.0

2.5

3.0

Page 62: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Tips Positioning (automatic)

Even with Python, it is far from trivial to implement. Remember,we want a tight fit, and placing should work for all modes ofinternal positioning.Basically, we build the tree from bottom to top up by joiningclades. the X position of extrema points for the clade is a linearfunction of the spacing between the sub-trees (where the spacinginside the two sub trees are fixed). So each extrema points sets alower limit on the spacing, and the largest is taken as the finalseparation.The best way to pick ε for a tree still needs to be worked out.

Page 63: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Drawing the Gene Tree

Hurdle 3: A suitable policy for drawing the gene tree.

We reuse the ideas of the Star Tree. Given the position of aninternal node, the left/right branches is drawn as a straight linestowards the “middle” of the left/right sub-trees. But we still needto handle the species transitions.

From the bottom up, we(linearly) map the lineagesleaving the branch to the topof the branch. The top ofthe clade is then put in themiddle of the mapped taxa.

Page 64: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Visual Clutter

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

g2

g1

g0

g3

g6

g5

g4

g7

g9

g8

g1

2g

10

g1

1

g1

3g

15

g1

4branches 300 generations

spe

cie

sA

spe

cie

sB

spe

cie

sC

spe

cie

sD

spe

cie

sE

Contained Tree

Page 65: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Gene Tree Inside Species Tree: As-Is

0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

ID2 ID3 ID4 ID0 ID1 ID5 ID6

Page 66: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Reducing Visual Clutter (1)

a b x ab x

The size (number of tips) of the sub-tree (b,x) is 2, but the span(number of tips between leftmost and rightmost) is 3.

Page 67: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Reducing Visual Clutter (2)

For every sequential arrangement of the gene tree taxa we can geta rough measure of the amount of crossings,∑

n∈Internal Nodesspan(n)− size(n)

size(n) is the number of taxa in the sub tree. span(n) is thenumber of taxa in the group bounded by the leftmost andrightmost tips of the sub tree. The difference is the excess taxa,the number of potential lineages that may need to cross out of theclade.Note that valid arrangements depend on the orientation of thespecies tree, so optimization should be over both species orderingand gene tips arrangements compatible with that order. Since thenumber is typically large, we resort to multiple tries of hillclimbing. Number of tries might be fixed or bounded by time.

Page 68: Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position

Unresolved Conflicts

Optimized tree on the right.