phycas lightning talk ievobio 2011

20
Estimating marginal likelihoods for phylogenetic models in Phycas Phycas is a software package for Bayesian phylogenetic inference (with support for ML searching planned). Paul Lewis is the primary author. Mark Holder and Dave Swofford are co-authors. Written in C++ and Python (using boost-python to create python bindings to C++ code). Compiled versions and manual: http://www.phycas.org Source: https://github.com/mtholder/Phycas

Upload: mark-holder

Post on 10-Jul-2015

342 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: phycas lightning talk iEvoBio 2011

Estimating marginal likelihoods for phylogenetic modelsin Phycas

Phycas is a software package for Bayesian phylogenetic

inference (with support for ML searching planned).

Paul Lewis is the primary author. Mark Holder and Dave

Swofford are co-authors.

Written in C++ and Python (using boost-python to create

python bindings to C++ code).

Compiled versions and manual: http://www.phycas.org

Source: https://github.com/mtholder/Phycas

Page 2: phycas lightning talk iEvoBio 2011

Bayesian model selection

• Use model averaging if we can “jump” between models, or

• Compare their marginal likelihood.

The Bayes Factor between two models:

B10 =Pr(D|M1)Pr(D|M0)

Pr(D|M1) =∫

Pr(D|θ,M1) Pr(θ)dθ

where θ is the set of parameters in the model.

Page 3: phycas lightning talk iEvoBio 2011

Two simple estimators of the marginal likelihood

1. mean of likelihood evaluated at parameter values randomly

drawn from the prior.

2. harmonic mean of likelihood evaluated at parameter values

randomly drawn from the posterior (Newton and Raftery,

1994).

Page 4: phycas lightning talk iEvoBio 2011

−2 −1 0 1 2

010

2030

40

Sharp posterior (black) and prior (red)

x

dens

ity

Page 5: phycas lightning talk iEvoBio 2011

From Dr. Radford Neal’s blog

The Harmonic Mean of the Likelihood: Worst Monte

Carlo Method Ever

“The total unsuitability of the harmonic mean

estimator should have been apparent within an hour

of its discovery.”

Page 6: phycas lightning talk iEvoBio 2011

Steppingstone sampling (Xie et al., 2010; Fan et al., 2010)

blends two distributions:

• the posterior, Pr(D|θ,M1) Pr(θ,M1)• a tractable reference distribution, π(θ)

pβ(θ|D,M1) =[Pr(D|θ,M1) Pr(θ,M1)]

β [π(θ)](1−β)

c0 = 1.0

Pr(D|M1) =c1c0

=(c1c0.38

)(c0.38c0.1

)(c0.1c0.01

)(c0.01c0

)=(c1

�����c0.38

)(�����c0.38

����c0.1

)(����c0.1

�����c0.01

)(�����c0.01c0

)

Page 7: phycas lightning talk iEvoBio 2011

c1c0

=(

c1c0.38

)(c0.38c0.1

)(c0.1c0.01

)(c0.01c0

)

Photo by Johan Nobel http://www.flickr.com/photos/43147325@N08/4326713557/ downloaded from Wikimedia

Page 8: phycas lightning talk iEvoBio 2011

Typically, Steppingstone sampling uses a series of slightly vaguer

distributions to estimate the ratio of normalizing constant:

−2 −1 0 1 2

010

2030

40

Steppingstone densities

x

dens

ity

Page 9: phycas lightning talk iEvoBio 2011

A reference distribution over tree topologies

We must be able to:

1. calculate the probability for any tree topology,

2. center the distribution on the posterior,

3. control the “vagueness” of the distribution,

4. efficiently sample trees from the distribution.

Page 10: phycas lightning talk iEvoBio 2011

Tree-Centered Independent-Split-Probability (TCISP)distribution

Argument: a tree with probabilities for each split.

Result: a probability distribution over all tree topologies.

Page 11: phycas lightning talk iEvoBio 2011

A

G

D

E

J L

HF

C

KI

0.9

0.8 0.6 0.5

0.4 0.8

0.3

0.9Input: a focal treeto center the distributionwith split probabilities

Page 12: phycas lightning talk iEvoBio 2011

A

G

D

E

J L

HF

C

KI

We will keep the blue branchesand avoid the red ones

Page 13: phycas lightning talk iEvoBio 2011

A G

D

EJ

LHF

CKI

Page 14: phycas lightning talk iEvoBio 2011

A

G

D

E

J LH

F

C

KI

One of the many resolutionswhich avoid the red branches

Page 15: phycas lightning talk iEvoBio 2011

A

G

D

E

J L

HF

C

KI

A

G

D

E

J LH

F

C

KI

Page 16: phycas lightning talk iEvoBio 2011

Counting trees:Bryant and Steel (2009) provide an O(n5) algorithm for

counting the number of trees that share no splits with another

tree.

Multitree steppingstone:

• Works on tiny trees (≤ 6 leaves) with no tuning;

• We are working on more efficient MCMC for larger trees;

• Code on: https://github.com/mtholder/Phycas/tree/sampling_ref_dist

Page 17: phycas lightning talk iEvoBio 2011

Conclusions

• Do not trust the harmonic mean estimator of the marginal

likelihood.

• Take a look at Phycas: http://www.phycas.org (under

GPLv2.0; source on GitHub).

• Watch for multitree steppingstone is a more generic, usable

form soon.

• Tree-Centered Independent-Split-Probability (TCISP) distribution

may be useful in other contexts: likelihood-based supertrees,

or MCMC proposals.

Page 18: phycas lightning talk iEvoBio 2011

Thanks: NSF AToL and iEvoBio

See: Xie et al. (2010); Fan et al. (2010); Lartillot

and Philippe (2006) for more discussion of estimating

marginal likelihoods.

Page 19: phycas lightning talk iEvoBio 2011

References

Bryant, D. and Steel, M. (2009). Computing the distribution of a treemetric. IEEE IEEE/ACM Transactions on Computational Biology andBioinformatics, 6(3):420–426.

Fan, Y., Wu, R., Chen, M.-H., Kuo, L., and Lewis, P. O. (2010). Choosingamong partition models in bayesian phylogenetics. Molecular Biology andEvolution, page (advanced access).

Lartillot, N. and Philippe, H. (2006). Computing Bayes factors usingthermodynamic integration. Systematic Biology, 55(2):195–207.

Newton, M. A. and Raftery, A. E. (1994). Approximate bayesian inferencewith the weighted likelihood bootstrap. Journal of the Royal StatisticalSociety, Series B (Methodological), 56(1):3–48.

Xie, W., Lewis, P. O., Fan, Y., Kuo, L., and Chen, M.-H. (2010). Improving

Page 20: phycas lightning talk iEvoBio 2011

marginal likelihood estimation for Bayesian phylogenetic model selection.Systematic Biology, 60(2):150–160.