the coalescent and measurably evolving populations alexei drummond department of computer science...

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

The Coalescent and Measurably Evolving Populations

Alexei Drummond

Department of Computer Science

University of Auckland, NZ

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Overview

1. Introduction to the Coalescent

2. Hepatitis C in Egypt• An example using the coalescent

3. Measurably evolving populations

4. HIV-1 evolution within and among hosts• An example using MEP concepts

5. Summary + Conclusions

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns The coalescent

• The coalescent is a model of the ancestral relationships of a small sample of individuals taken from a large background population.

• The coalescent describes a probability distribution on ancestral genealogies (trees) given a population history. – Therefore the coalescent can convert information from ancestral

genealogies into information about population history and vice versa.

• The coalescent is a model of ancestral genealogies, not sequences, and its simplest form assumes neutral evolution.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns The history of coalescent theory

• 1930-40s: Genealogical arguments well known to Wright & Fisher

• 1964: Crow & Kimura: Infinite Allele Model

• 1966: (Hubby & Lewontin) & (Harris) make first surveys of population allele variation by protein electrophoresis

• 1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So do King & Jukes

• 1971: Kimura & Ohta proposes infinite sites model.

• 1975: Watterson makes explicit use of “The Coalescent”

• 1982: Kingman introduces “The Coalescent”.

• 1983: Hudson introduces “The Coalescent with Recombination”

• 1983: Kreitman publishes first major population sequences.

• 1987: Cann et al. traces human origin and migrations with mitochondrial DNA.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns The history of coalescent theory

• 1988: Hughes & Nei: Genes with positive Darwinian Selection.

• 1989-90: Kaplan, Hudson, Takahata and others: Selection regimes with coalescent structure (MHC, Incompatibility alleles).

• 1991: MacDonald & Kreitman: Data with surplus of replacement interspecific substitutions.

• 1994-95: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces sampling techniques to estimate parameters in population models.

• 1997-98: Krone-Neuhauser introduces Ancestral Selection Graph

• 1999: Wiuf & Donnelly uses coalescent theory to estimate age of disease allele

• 2000: Wiuf et al. introduces gene conversion into coalescent.

• 2000-: A flood of SNP data & haplotypes are on their way.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Population processes

Genealogy

COALESCENT

THEORY

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Randomly sample individuals from population

Obtain gene sequences from sampled individuals

Reconstruct tree / trees from sequences

Infer coalescent results from tree / trees

Infer coalescent results directly from sequences

Coalescent inference

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Demographic History

• Change in population size through time

• Applications include– Estimating history of human populations– Conservation biology– Reconstructing infectious disease epidemics– Investigating viral dynamics within hosts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Idealized Wright-Fisher populations

Now

Parents

Grand parents

Haploid Diploid

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Random mating in an ideal population

•A constant population size of N individuals•Each individual in the new generation “chooses” its parent from the previous generation at random

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Genetic drift: extinction and ancestry

If you trace the ancestry of a sample of individuals back in time you inevitably reach a single most recent common ancestor.If you pick a random individual and trace their descendents forward in time, all the descendents of that individual will with high probability eventually die out.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Dis

cret

e G

ener

atio

ns

Present

PastA sample genealogy of 3 sequences from a population (N =10).

Present

Past

A sample genealogy from an idealized Wright-Fisher population

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

The coalescent: distributions and expectations on a sample genealogy

Present

Past

t2 ~ Exp(N)

t3 ~ Exp N /3

E[t3]N /3

E[t2]N

E[tk ]2N

k(k 1)

E[troot ]2N 11

n

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

The coalescent: probability density distribution

Present

Past

P(t2 | N) 1

Nexp

t2

N

P(t3 | N) 3

Nexp

3t3

N

fG (g | N) k(k 1)

Nexp

k(k 1)tk

N

k2

n

dg

t2 ~ Exp(N)

t3 ~ Exp N /3

Kingman (1982a,b)

g Eg ,t The genealogy is an edge graph Eg and a vector of times t.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

The coalescent: estimating population size from a sample genealogy

Present

Past

ˆ N k k(k 1)

2tk

ˆ N 3 24

ˆ N 2 7

ˆ N 1

n 1

k(k 1)

2tk

k2

n

ˆ N 15.5

t2 7

t3 8

Felsenstein (1992)

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

The coalescent: estimating population size confidence limits via ML

Maximum likelihood can be used to estimate population size by choosing a population size that maximizes the probability of the observed coalescent waiting times.

The confidence intervals are calculated from the curvature of the likelihood. For a single parameter model the 95% confidence limits are defined by the points where the log-likelihood drops 1.92 log-units below the maximum log-likelihood.

-20

-18

-16

-14

-12

-10

-8

-6

1 10 100 1000

Population size (N)

rela

tive log lik

elihood

ˆ N 15.5 (5.1, 93.1)

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Exponential growth Constant size

The coalescent: shapes of gene genealogies

The coalescent can be used to convert coalescent times into knowledge about population size and its change though time.

Th

e C

oal

esce

nt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

smallsmall NN00 large large NN00

TIMETIME

Constant population size: N(t)=N0

Th

e C

oal

esce

nt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Coalescent and serial samples

Constant population Exponential growth

Th

e C

oal

esce

nt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Uncertainty in Genealogies

Th

e C

oal

esce

nt

How similar are these two trees? Both of them are plausible given the data.

We can use MCMC to get the average result over all plausible trees,

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Coalescent Summary

• The coalescent provides a theory of how population size is related to the distribution of coalescent events in a tree.

• Big populations have old trees• Exponentially growing populations have star-like trees• Given a genealogy the most likely population size can be

estimated.• MCMC can be used to get a distribution of trees from

which a distribution of population sizes can be estimated.

Th

e C

oal

esce

nt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Markov chain Monte Carlo (MCMC)

MC

MC

• Imagine you would like to estimate two parameters (,) from some data (D).

• You want to find values of and that have high probability given the data: p(,|D)

• Say you have a likelihood function of the form: Pr{D| ,}

• Bayes rule tells us that:– p(,|D) = Pr{D| ,}p(,) / Pr{D}– So that p(,|D) Pr{D| ,}p(,)

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio


MC

MC

• p(,|D) is called the posterior probability (density) of , given D• In an ideal world we want to know the posterior density for all

possible values of ,.• Then we could pick a “credible region” in two dimensions that

contained values of , that account for the majority of the posterior probability mass.

• This credible region would serve as an estimate that includes incorporates our uncertainty and this credible set could be used to address hypotheses like: is greater than x.

• In reality we have to make due with a “sample” of the posterior - so that we evaluate p(,|D) for a finite number (say 10,000,000) pairs of ,.

• So which pairs should we choose?

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio


MC

MC

• Lets construct a random walk in 2-dimensional space• In each step of the random walk we propose to make an (unbiased)

small jump from our current position (,) to a new position (’,’) • If p(’,’|D) > p(,|D) then we make the proposed jump• However, if p(’,’|D) < p(,|D), then we make the proposed jump

with probability = p(’,’|D) / p(,|D), otherwise we stay where we are.

• It can be shown (trust me!) that if you proceed in this fashion for an infinite time then the equilibrium distribution of this random walk will be p(,|D)!

• That is, the random walk will visit a particular region [0, 1] x [0, 1] of the state space this often:

p(, | D)dd 0

1

0

1

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio


MC

MC

p(, | D) Z Pr{D |,g} f ( | g) f (,)g

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Hepatitis C Virus (HCV)

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

• Identified in 1989• 9.6kb single-stranded RNA genome • Polyprotein cleaved by proteases• No efficient tissue culture system

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns How important is HCV?

• 170m+ infected• ~80% infections are chronic• Liver cirrhosis & cancer risk• 10,000 deaths per year in

USA• No protective immunity?

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Percutaneous exposure to infected bloodPercutaneous exposure to infected blood

• Blood transfusion / blood products

• Injecting & nasal drug use

• Sexual & vertical transmission

• Unsafe injections

• Unidentified routes

HCV Transmission

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Estimating demographic history of HCV using the coalescent

• Egyptian HCV gene sequences

• n=61

• E1 gene, 411bp

• All sequence contemporaneous

• Egypt has highest prevalence of HCV worldwide (10-20%)

• But low prevalence in neighbouring states

• Why is Egypt so seriously affected?

• Parenteral antischistosomal therapy (PAT)

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Demographic model

• The coalescent can be extended to model deterministically varying populations.

• The model we used was a const-exp-const model.

• A Bayesian MCMC method was developed to sample the gene genealogy, the substitution model and demographic function simultaneously.

N(t)NC if t x

NC exp[ r(t x)] if x t y

NA if t y

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Estimated demographic history

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Averaged over all trees

Based on a single tree

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Parameter estimates

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Uncertainty in parameter estimates

Growth rate of the growth phase

Grey box is the prior

Rates at different codon positions,

All significantly different

Demographic parameters Mutational parameters

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Full Bayesian Estimation

• Marginalized over uncertainty in genealogy and mutational processes

• Yellow band represents the region over which PAT was employed in Egypt

Po

pu

lati

on

gen

etic

s o

f H

epat

itis

C in

Eg

ypt

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Measurably evolving populations (MEPs)

Present time point (n = 5)

Earlier time point (n = 5)

• MEP pathogens:– HIV – Hepatitis C– Influenza A

• MEPs from ancient DNA– Bison– Brown Bears– Adelie penguins– Anything cold and numerous

• Even over short periods (less than a year) HIV sequences can exhibit measurable evolutionary change

• Time-structure can not be ignored in our models

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Time structure in samples

timeContemporary sampleno time structure

Serial samplewith time structure

2000

1980

1990

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Ne

time

AB

CD

E

Molecular evolution and population genetics of MEPs

• Given sequence data that is time-structured estimate true values of:– substitution parameters

• Overall substitution rate and relative rates of different substitutions

– population history: N(t)– Ancestral genealogy

• Topology• Coalescent times

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Molecular evolutionary model: Felsenstein’s likelihood (1981)

b1

b2 b5

b4

b3

AAAC

GA

GCThe probability of the sequence alignment,

can be efficiently calculated given a tree and branch lengths (T), and a probabilistic model of mutation represented by an instantaneous rate matrix (Q). In phylogenetics, branch lengths are usually unconstrained.

Pr{D | T,Q}

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Combining the coalescent with Felsenstein’s likelihood

b1

b2 b5

b4

b3

AA

AC

GA

GC

t2

t4

t3

AA

GA AC GC

The “molecular clock” constraint

2n–3 branch lengths n–1 waiting times

p(N,g,Q | D)Pr{D |g,Q} fG (g | N) fN (N) fQ (Q)

The joint posterior probability of the population history (N), the genealogy (g) and the mutation matrix (Q) are estimated using Markov chain Monte Carlo (Drummond et al, Genetics, 2002)

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Full Bayesian Model

P(g, , Ne, Q | D) Z

1P(D | g, , Q)fG(g | Ne) f()fN(Ne )fQ(Q)

Probability of what we don’t know given what we do know.

Unknown normalizing constant

Likelihood function

coalescent prior

other priors

=

In the software package BEAST, MCMC integration can be used to provide a chain of samples from this density.

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Q = substitution parameters

Ne = population parameters

g = tree

= overall substitution rate

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Pt.7

Pt.9

HIV1U36148

HIV1U36015HIV1U35980

HIV1U36073

HIV1U35926

HIVU95460

Pt.2

Patient #6 fromWolinsky et al.

Pt.5

Pt.3Pt.1Pt.8

Pt.6

10%

Shankarappa et al (1999)

HIV-1 (env) evolution in nine infected individuals

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Molecular clock: HIV-1 (env) evolution in 9 individuals

0 2 4 6 8 10

Years Post Seroconversion

Viral Divergence

2%

4%

6%

8%

10%

Shankarappa et al (1999)

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns MEP Summary

• Most RNA viruses, including HCV and HIV are measurably evolving

• Most vertebrate populations that have well-preserved recent fossil records are MEPs.

• If sequence data comes from different times the time-structure can’t be ignored

• Time structure permits the direct estimation of:– substitution rate– Concerted changes in substitution rate– coalescent times in calendar units– Demographic function N(t) in calendar units

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Intermission

My brain is fried!

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

• HIV is a retrovirus.

• Within infected individuals HIV exhibits extremely high genetic variability due to:

– Error-prone reverse transcriptase (RT) that converts RNA to DNA (error rate is about one mutation per genome per replication cycle).

– DNA-dependent polymerase also error-prone

– High turnover of virus within infected individual throughout infection.

HIV virion

HIV gp120 binds to CD4 T cell surface receptors

viral core inserted into cell

replication of virus genome by reverse transcription (ssRNA to dsDNA)

integration of proviral DNA into DNA of infected cell

viral RNA transcription

regulatory genes

migration of dsDNA to nucleus

viral genomic RNA

RNA packaging and virion assembly

structural proteins and viral enzymes

translation

LTRLTR

budding of virus from cell and maturation

nucleus

host cell

What is HIV?

Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Patient 2 (Shankarappa et al, 1999)

0

0

12 20 30 40 51 61 68 73 80 85 91 103 126

11 22 20 8 20 20 20 10 8 20 9 20 22

Time in months (post seroconversion)

Number of sequences obtained per sample

• 210 sequences collected over a period of 9.5 years• 660 nucleotides from env: C2-V5 region• Effective population size and mutation rate were co-estimated using

Bayesian MCMC.Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

A tree sampled from the posterior distribution

Lineage A

Lineage B

‘Ladder-like’ appearance

Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Estimated substitution rate

• Patient 2:– 0.77–1.0% per year

• BUT….Long term rates in HIV

– Korber et al:• 0.24% (0.18-0.28%) per year• Only 1/4 of the intrapatient

rate

Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Bayesian MCMC of Shankarappa data

estimate upper limitp1 Logistic 0.0123 882 0.278 1.57% 6.68%p2 plasma Logistic 0.0166 1708 0.242 0.63% 3.34%p2 provirus Logistic 0.0090 2798 0.278 0.04% 0.18%p3 Logistic 0.0175 620 0.237 1.80% 4.97%p5 plasma Exponential 0.0223 938 0.192 11.80% 27.50%p5 provirus Logistic 0.0215 1345 0.293 8.19% 15.20%p6 Logistic 0.0195 581 0.221 0.52% 1.35%p7 Logistic 0.0085 3320 0.322 2.94% 13.60%p8 Exponential 0.0162 2309 0.455 28.70% 48.50%p9 Logistic 0.0071 2757 0.346 17.90% 44.80%p11 Logistic 0.0128 2502 0.239 0.15% 0.53%

Overall Logistic 0.0148 1796 0.282 6.75% 15.15%* At the time of last sample assuming a generation length of 2.6 days

Bottleneck (at seroconversion)

Patient

Best-fitting demographic

modelEstimated rate

(per site per year)

Effective population

size*

Rate heterogeneity

(alpha parameter)

Mea

sura

bly

evo

lvin

g p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Intra- and interpatient rate estimates (C2V3 envelope)

0.00E+00

5.00E-03

1.00E-02

1.50E-02

2.00E-02

2.50E-02

3.00E-02

0 50 100 150 200 250

Sampling interval (months)

rate

(p

er

sit

e p

er

year)

Intrapatient estimates

Interpatient estimates

BAC

p1 - p11

Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Summary: HIV intra-patient evolution

• HIV evolutionary rates appear to be faster intra-patient then across pandemic– Different selection pressure at transmission?– Transmitted viruses undergoing less rounds of

replication?– Latent viruses?– Reversion of escape mutants?

• Effective population size is changing over time (bottleneck in envelope at least)

Po

pu

lati

on

gen

etic

s o

f H

IV

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns But how good is our best model?

• We can use standard statistical model-choice criteria to choose between different models of substitution and demography, but are any of the models we consider any good at all?

• One way to look at this is ask the following question:– Does our real data look anything like what we would

expect data from our model to look like?

• So what aspect of the data should we look at?

• And what should we expect?

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

We could look at branch length distributions…

E[Ln ]2Ne

E[Jn ]2Ne

1

kk1

n 1

E[troot ]2Ne 11

n

troot

Ln

Jn Ln

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Tree imbalance measures might also be interesting…

4 cherries 3 cherries 2 cherries

Ic 0

Ic 0.24

Ic 0.81

N 3

N 3.125

N 4.125

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Posterior predictive simulation

• A method of testing the goodness-of-fit of a Bayesian model.

1. Run a Bayesian MCMC analysis on the data2. Calculate the value of your favourite summary statistic, T(.)

from the data, D3. For each state in the chain

1. Simulate a synthetic dataset, Di, using the parameter values of state i.

2. Calculate T(Di) from the simulated data set.

4. Compare the T(D) value with predictive distribution of T(Di)

Po

ster

ior

pre

dic

tive

sim

ula

tio

n

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns So we need some summary statistics

• Summary statistics that can be measured directly from sequence alignment:– Mean pairwise distance

()– Tajima’s D– Fu & Li’s D– Number of segregating

sites (S)– …

• Summary statistics that can be measured directly from an genealogy:– Genealogical mean

pairwise distance ()– Genealogical Tajima’s D– Genealogical Fu & Li’s D– Tree-imbalance statistics– Age of the root– Length of the treeP

ost

erio

r p

red

icti

ve s

imu

lati

on

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Posterior predictive simulation (2)

• Testing the goodness-of-fit of the neutral coalescent model under variable demographic functions.

1. Run a Bayesian MCMC analysis on the data2. For each state in the chain

1. Simulate a coalescent genealogy (GiS) using the population parameter

values of state i.

2. Calculate T(GiS) from the ith simulated genealogy

3. Calculate T(GiP) from the ith posterior genealogy

3. Calculate the predictive probability by comparing the posterior distribution of T(.) with predictive distribution of T(.):

Po

ster

ior

pre

dic

tive

sim

ula

tio

n

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Human influenza A (HA gene) trees

Posterior genealogy Predictive simulations

State 5m

State 10m

Ne 9.12

t2 11.03 years

Ne 5.00

t2 15.29 years

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Human influenza A trees: Genealogical Fu & Li’s D statistic

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Puerto Rican Dengue-4 gene trees: multivariate summary statistics

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Results of test of neutrality

Table 2. The predictive probabilities (

PT) for summary statistics on each of the example

data sets are shown. Significant departures from neutrality are marked (*) and marginallysignificant departures (x < 0.05 or x > 0.95) are marked with (†). Significant departureson the best fitting model for each data set are in bold.

Predictive probabilitiesDataset Demographic

modelT troot DFL IC Cn B1

Brown bear Constant 0.739 0.815 0.863 0.693 0.163 0.103

(d-loop) Exponential growth 0.615 0.623 0.800 0.679 0.163 0.111

RSVA Constant 0.956† 0.964† 0.946 0.163 0.152 0.134

(g gene) Exponential growth 0.693 0.656 0.884 0.206 0.149 0.134

Dengue-4 Constant 0.9574† 0.9958* 0.9997* 0.562 0.608 0.427

(E gene) Exponential growth 0.745 0.809 0.9792* 0.559 0.653 0.505

Human influenza A Constant 0.9510† 0.900 0.9999* 0.0462† 0.605 0.610

(HA) Exponential growth 0.910 0.620 0.9995* 0.0866 0.575 0.677

Go

od

nes

s-o

f-fi

t te

sts

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Results for 28 HIV-1 infected individuals

Go

od

nes

s-o

f-fi

t te

sts

Fu & Li's D

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

proportion of data sets

P*

Sim Constant sizeSim Exponential GrowthTargetConstant sizeExponential Growth

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Is the population size constant?

Patient 2

Po

pu

lati

on

gen

etic

s o

f H

IV

Pop size

10

100

1000

0 20 40 60 80 100 120

months (post seroconversion)

Ne /

30

mean

lower

upper

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Virus population dynamics

Measles virus

Human influenza virus

Ph

ylo

dyn

amic

s

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Dengue-4: Modeling complex demography

Hospital case data courtesy of Shannon Bennett

N(t) = N0exp(-rt): -10566.421N(t) = scaled translated case data: -10478.572

Ph

ylo

dyn

amic

s

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Population size changes

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns The generalized skyline plot

• Visual framework for exploring the demographic history of sampled DNA sequences

• Input: a single estimated ancestral genealogy (a tree)• Output: nonparametric plot of the population size through time

– Groups adjacent coalescent intervals

– Converts information within these intervals to estimates of population size

Po

pu

lati

on

siz

e ch

ang

es

ˆ N k k(k 1)

2tk

ˆ N k,l k(k l)

2lti

ik l1

k

Estimate of population size from single coalescent interval

Estimate of population size from l adjacent coalescent intervals.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Examples

I: Constant population size

N(t)=N(0)

Gen

eral

ized

Sky

line

Plo

t

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Skyline Plot

I: Constant population size

N(t)=N(0)

II: Exponential growth

N(t)=N(0)e-rt

Gen

eral

ized

Sky

line

Plo

t

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Skyline Plot

III: HIV-1 group M(tree estimated in Yusim et al (2001) Phil. Trans. Roy. Soc. Lond. B 356: 855-866)

– Black curve is a parametric estimate obtained from the same data under the “expansion model”

– Results follow accepted demographic pattern for the HIV pandemic

Gen

eral

ized

Sky

line

Plo

t

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns The Bayesian skyline plot

Estimate a demographic function that has a certain fixed number of steps (in this example 15) and then integrate over all possible positions of the break points.

Explains the Dengue data quite well (test of neutrality do not reject the data if we use the Bayesian skyline plot to describe the demographic history.

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Prior/Model: population is auto-correlated through time

In addition to this model we also introduce a simple smoothing on

which representsour belief that effective population size is auto-correlated through time. The priordistribution we assume in all subsequent simulations and analyses is that, going back intime, each new population size is drawn from an exponential distribution with a meanequal to the previous population size:

j ~ Exp( j 1), 2 j m . (5)

In addition we introduce a scale-invariant prior (Jeffreys 1946) on the first element

f1(1)

1

1

to signify that our prior belief is invariant to changes in time scale. This

results in a following simple prior distribution on

:

f ()1

1

j 1 exp j / j 1 j2

m

. (6)

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Validating the Bayesian skyline plot (1)

Simulated data: Constant population Simulated data: Exponential growth

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Validating the Bayesian skyline plot (2)

Bayesian skyline (49 or 12 epochs)

0.00001

0.0001

0.001

0.01

0.1

1

10

100

0 0.002 0.004 0.006 0.008

Time (mutations)

Th

eta

Median (49)lower (49)upper (49)truthMedian (12)lower (12)upper (12)P

op

ula

tio

n s

ize

chan

ges

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Comparing Bayesian skyline plot of Dengue-4 with incidence data

1

10

100

1000

10000

100000

0 50 100 150 200Months (before November 1998)

Eff

ect

ive n

um

ber

of

infe

ctio

ns

Median

Upper

Lower

Smoothed translated incidence

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Example of Bayesian skyline plot

(1920-1980) Anti-schistosomal needle-based treatment

Effective population size jumped from 300 to 100 to 10,000

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Comparison to parametric model

Po

pu

lati

on

siz

e ch

ang

es

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

http://evolve.zoo.ox.ac.uk/BEAST

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Coalescent with population structure

Str

uct

ure

d p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Str

uct

ure

d p

op

ula

tio

ns

Population subdivision - two demes

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Stepping stone model of subdivision

Str

uct

ure

d p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

From Cavalli-Sforza,2001

Human migration

Str

uct

ure

d p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns

Africa Non-Africa

0.2 Mu

tatio

n ra

te =

2.5

Ra

te o

f co

mm

on

an

ce

stry =

1

Past

Present

Simplified model of human evolution

Str

uct

ure

d p

op

ula

tio

ns

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Why Bayesian?

• Probabilistic model-based inference– Can make simple statements about the probability of alternative

hypotheses given the data

• Markov chain Monte Carlo– Convenient computational technique– Allows for complex models: “if you can simulate you can sample”

• Incorporates prior probabilities – P(|D) P(D| )P()– Convenient means of assessing alternative sets of assumptions– Allows incorporation of independent sources of information

• Easy to include sources of uncertainty– Don’t need to assume perfect knowledge of tree (for example)– Can treat the tree and a nuisance parameter and focus on parameters

of interest (strength of selection, mutation rate, growth rate, etc)

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Conclusions & cautionary remarks

• Bayesian MCMC has advantages– a useful tool for exploring prior hypotheses– Good for assessing levels of uncertainty– Complex models can be investigated on practical datasets

• Bayesian MCMC has disadvantages– Diagnostics are difficult, and it is essentially impossible to

guarantee correctness– Model comparison can be difficult– Requires large programs that are difficult to optimize and debug.

Th

e C

oal

esce

nt

and

Mea

sura

bly

Evo

lvin

g P

op

ula

tio

ns Conclusions & cautionary remarks (2)

• Population genetics has advantages– provides a framework for objective analysis of genetic data– Allows interpretation of genetic data in terms of biological

properties of virus– Can be extended to include selection, recombination et cetera

• Population genetics has disadvantages– Models are still too simple– Assumptions are too strong– Extending to complex models that include changing selection

pressures and recombination are possible in MCMC but still very difficult!