approximate bayesian computation - oxford …filippi/teaching/cdt/abclecture.pdfapproximate bayesian...

Approximate Bayesian Computation

Sarah Filippi

Department of StatisticsUniversity of Oxford

09/02/2016

Parameter inference in a signalling pathway

Growth factor

Receptor

RAS

RAF

MEKPPMEK

PP

ERKPP

MAP kinasekinase kinase

MAPkinase kinase

MAP kinase

Cellmembrane

Transcription, Proliferation, Survival, Differentiation

A

ERK ERK ERK ERKP P

PERKP

ERK ERK ERK ERKP P

PERKP

ERK ERKPPERK

P

Pt

ERKPP

Pt

Pt

ERKP

Pt

PtPt

Distributive

Phosph

orylation

Dep

hosp

horylatio

n

MEKPPMEK

PP

MEKPPMEK

PP

MEKPPMEK

PP MEK

PPMEK

PP

MEKPPMEK

PP

MEKPPMEK

PP

500

1000

1500

time

●

●

●●

●

●

●

●

●

● ●● ●

● ●● ● ● ● ● ● ● ●

●

●

●●

●●

●

● ●●

●

●

●●

● ● ●●

●●

●●

●

0 8 16 24 32 40

pERKpMEK

Model via differentialequations

Species 9Parameters 16

Initial conditions 2

Approximate Bayesian Computation Sarah Filippi 1 of 33

NotationsWe have:• an observed dataset, denoted by x∗

• a parametric model, either deterministic or stochastic.Model = data-generating process:given a value of parameter θ ∈ Rd, the output of the system is

X ∼ f (·|θ)

The likelihood function: θ 7→ f (x∗|θ)

”All models are wrong but some are useful.” George E.P. Box

Aim:• parameter inference: determine how likely it is for each value ofθ to explain the data

• model selection: rank candidate models in terms of their abilityto explain the data


Bayesian methodsIn the Bayesian framework, we combine• the information brought by the data x∗, via the likelihood f (x∗|θ)• some a priori information, specified in a prior distribution π(θ)

These informations are summarized in the posterior distribution,which is derived using the Bayes formula:

p(θ|x∗) =f (x∗|θ)π(θ)∫

f (x∗|θ′)π(θ′)dθ′

0 20 40 60 80 100

likelihoo

d

θ0 20 40 60 80 100

prior

θ0 20 40 60 80 100

posterior

θ

X =


Sample from the posterior distribution

• In general, no closed-form solution• Use computer simulation and Monte-Carlo techniques to

sample from the posterior distribution• Typical likelihood-based approaches:◦ Importance sampling◦ Markov Chain Monte-Carlo (MCMC)◦ Population Monte-Carlo◦ Sequential Monte-Carlo (SMC) sampler◦ ...


What if we can not compute the likelihood?

In many applications of interest: either impossible orcomputationally too expensive.• stochastic differential equations• population genetics• ...

Use so called likelihood-free method such asApproximate Bayesian Computation (ABC).



• Principle:◦ sample parameters θ from the prior distribution◦ select the values of θ such that the simulated data are close to the

observed data.

• More formally: given a small value of ε > 0,

p(θ|x∗) =f (x∗|θ)π(θ)

p(x∗)≈ pε(θ|x∗) =

∫f (x|θ)π(θ)1∆(x,x∗)≤ε dx

p(x∗)



θ1

θ2

Model

t

x

Data, x∗

Simulation, x(θ)

d = ∆(x(θ), x∗)

Reject θ if d > εAccept θ if d ≤ ε

Acknowledgement to Prof. Michael Stumpf (Theoretical System Biology group, Imperial College London) for the slides.

Toni & Stumpf, Bioinformatics (2010)



• Very sensitive to the choice of ε:◦ ε should be close to 0 to obtain samples from a distribution close

to the posterior distribution

limε→0

pε(θ|x∗) = p(θ|x∗)

◦ if ε too small, very low acceptance rate• More efficient variants of ABC:◦ regression-adjusted ABC

Tallmon et al., 2004; Fagundes et al., 2007; Blum and Francois, 2010

◦ Markov chain Monte Carlo ABC schemesMarjoram and Molitor, 2003; Ratmann et al., 2007

◦ ABC implementing some variant of sequential importancesampling (SIS) or sequential Monte Carlo (SMC)Sisson et al., 2007; Beaumont et al., 2009; Toni et al., 2009; Del Moral et al., 2011


Sequential ABC – a schematic representation

Prior, π(θ)

Population 1

...

Define set of intermediate distributions, πt, t = 1, ...., T

ε1 > ε2 > ...... > εT

πt−1(θ|∆(x∗, x(θ)) < εt−1)

Population t − 1

πt(θ|∆(x∗, x(θ)) < εt)

Population t

...

πT(θ|∆(x∗, x(θ)) < εT)

Population T

Acknowledgement to Prof. Michael Stumpf (Theoretical System Biology group, Imperial College London) for the slides.Toni et al. , J.Roy.Soc. Interface (2009); Toni & Stumpf, Bioinformatics (2010).


Sequential ABC – the approach (1)

In the rest of this lecture, we focus on the approach proposed byBeaumont et al., 2009; Toni et al., 2009.

The algorithms are related to the population Monte-Carloalgorithm (Cappe et al, 2004) and invoke importance samplingarguments:

• if a sample θ(t) = (θ(1,t), . . . θ(N,t)) is produced by simulatingeach θ(i,t) ∼ qit independent of now another conditional on thepast samples

• and if each point θ(i,t) is associated to an important weight

ω(i,t) ∝ π(θ(i,t))

qit

then 1N

∑Ni=1 ω

(i,t)h(θ(i,t)) is an unbiased estimator of∫

h(θ)π(θ)dθ.


Sequential ABC – the approach (2)In the sequential ABC algorithm, at each population t, we considera sample θ(t) = (θ(1,t), . . . θ(N,t)).

Each θ(i,t) is sampled independently as follows:• θ is sampled from the previous population{θ(i,t−1), ω(i,t−1)}1≤i≤N ,

• then it is perturbed using a perturbation kernel θ ∼ Kt(.|θ)therefore the proposal distribution is

qt(.) =

n∑j=1

ω(j,t−1)Kt(.|θ(j,t−1))

and the important weight associated to θ is

ω ∝ π(θ)∑nj=1 ω

(j,t−1)Kt(θ|θ(j,t−1)).


Sequential ABC – the algorithm

1: for all t do2: i← 13: repeat4: if t=1 then5: sample θ from π(θ)6: else7: sample θ from the previous population {θ(i,t−1), ω(i,t−1)}1≤i≤N

8: perturb θ from Kt(·|θ) so that π(θ) > 09: end if

10: sample x from f (·|θ)11: if ∆(x∗, x) ≤ εt then12: θ(i,t) ← θ; i← i + 113: end if14: until i = N + 115: calculate the weights: ω(i,t) ∝ π(θ(i,t))∑n

j=1 ω(j,t−1)Kt(θ(i,t)|θ(j,t−1))

; ω(i,1) = 1/N

16: end for

Toni et al, 2009; Beaumont et al, 2009


Sequential ABC – the algorithm




j=1 ω(j,t−1)Kt(θ(i,t)|θ(j,t−1))

; ω(i,1) = 1/N

16: end for



Impact on computational efficiency:• perturbation kernels {Kt}t

• threshold schedule {εt}t

Choice of the perturbation kernel

Use adaptive perturbation kernel

• Local perturbation kernel:◦ hardly moves particles◦ high acceptance rate if successive values of ε close enough

• Widely spread kernel:◦ exploration of the parameter space◦ low acceptance rate

Balance between exploring the parameter space andensuring a high acceptance rate


Properties of optimal kernel

1. From sequential importance sampling theory:

similarity between two joint distributions of (θ(t−1), θ(t)) whereθ(t−1) ∼ pεt−1 and◦ θ(t) constructed by perturbing θ(t−1) and accepting according to

threshold εt◦ θ(t) ∼ pεt

2. Computational efficiency: high acceptance rate

3. Theoretical requirements for convergence:

◦ kernel with larger support than the target distribution⇒ guarantee asymptotic unbiasedness of the empirical mean

◦ vanish slowly enough in the tails of the target⇒ guarantee finite variance of the estimator


Derivation of optimal kernel

Criteria 1: Resemblance between the two distributions

qεt−1,εt (θ(t−1), θ(t)|x)

=pεt−1(θ

(t−1)|x)Kt(θ(t)|θ(t−1))

∫f (x|θ(t)) 1 (∆(x∗, x) ≤ εt) dx

α(Kt, εt−1, εt, x)

and

q∗εt−1,εt(θ(t−1), θ(t)|x) = pεt−1(θ

(t−1)|x)pεt (θ(t)|x)

in terms of the KL divergence (Douc et al, 2007; Cappe et al, 2008; Beaumont et al, 2009)

KL(qεt−1,εt ; q∗εt−1,εt) = −Q(Kt, εt−1, εt, x) + logα(Kt, εt−1, εt, x) + C(εt−1, εt, x)

where

Q(Kt, εt−1, εt, x) =

∫∫pεt−1(θ

(t−1)|x)pεt (θ(t)|x) log Kt(θ

(t)|θ(t−1))dθ(t−1)dθ(t)


Derivation of optimal kernel

Criteria 1: mimise the KL divergence:

KL(qεt−1,εt ; q∗εt−1,εt) = −Q(Kt, εt−1, εt, x) + logα(Kt, εt−1, εt, x) + C(εt−1, εt, x)

Criteria 2: maximise the acceptance rate α(Kt, εt−1, εt, x)

MethodSelect the kernel Kt which maximises Q(Kt, εt−1, εt, x) which isequivalent to maximise −KL(qεt−1,εt ; q∗εt−1,εt

) + logα(Kt, εt−1, εt, x).

Remark:

Q(Kt, εt−1, εt, x) =

∫∫pεt−1(θ

(t−1)|x)pεt (θ(t)|x) log Kt(θ

(t)|θ(t−1))dθ(t−1)dθ(t)

can be maximized easily for some families of kernels.


Gaussian random walk kernels• Component-wise kernel:

diagonal covariance matrix Σ(t) Beaumont et al, 2009

• Multi-variate normal kernel:

Σ(t) ≈N∑

i=1

N0∑k=1

ω(i,t−1)ω(k)(θ(k)−θ(i,t−1))(θ(k)−θ(i,t−1))T

θ1

θ2

θ1

θ2

where{θ(k)}

1≤k≤N0

={θ(i,t−1) s.t. ∆(x∗, x(i,t−1)) ≤ εt, 1 ≤ i ≤ N

}• Local multi-variate normal kernels:

use a different kernel for each particle θ(t−1).◦ Nearest neighbours: Σ

(t)θ(t−1),M based on the M nearest neighbours

◦ Optimal local covariance matrix:

Σ(t)θ(t−1) ≈

N0∑k=1

ω(k)(θ(k) − θ(t−1))(θ(k) − θ(t−1))T

◦ Based on FIM for models defined by ODE or SDE


Computational efficiency of perturbation kernels

p1

p2

p3

m1

m2

m3

05000

1000015000

01

23

64

5

x106

Num

ber o

f sim

ulat

ions

Running time

Uniform Normalcompon.

Wholepopulation

M-nearestneighbours

OLCM FIM scaledbased on

whole pop.

FIM scaledbased onM neighb.

Multivariate normal

p1

p2 m

01

23

4

200400

600800

10001200

0

Num

ber

of s

imul

atio

ns

Running tim

e

Uniform Normalcompon.

Wholepopulation

M-nearestneighbours

OLCM FIM scaledbased on

whole pop.

FIM scaledbased onM neighb.

Multivariate normal

x105

gene

The Repressilatorsystem

The Hes1 transcriptional

regulation model


Computational cost of perturbation kernel

• simulating the data dominates• computational cost of perturbation kernel implementation:

Component-wise normal O(dN2)Multivariate normal based on the whole previous population O(d2N2)

Multivariate normal based on the M nearest neighbours O((d + M)N2 + d2M2N)Multivariate normal with OLCM O(d2N2)

Multivariate normal based on the FIM O(dCN + d2N2)(normalized with entire population)

N = population size; d = parameter dimension

RecommendationUse of multivariate kernels with OLCM• highest acceptance rate in our examples• relatively easy to implement at acceptable computational cost


Choice of the threshold schedule




j=1 ω(j,t−1)Kt(θ(i,t)|θ(j,t−1))

; ω(i,1) = 1/N

16: end for



• final value of ε close to 0• minimise number of simulations i.e.◦ minimise number of populations◦ maximise acceptance rate per pop.

Choice of the threshold schedule

Previous approaches• Trial and error• Adaptive method based on quantile

Beaumont et al, 2009; Del Moral et al, 2008; Drovandi et al, 2011; Lenormand et al, 2012

• ...

The quantile based approach:εt = α - quantile of the distances {∆(x(i,t−1), x∗)}1≤i≤N

Which value of α should we chose?

An adaptive choice of α: Sedki et al, 2013.


Drawback of the quantile approach

0 5 10 15 20

050

100

150

θ

dist

ance

bet

ween

g(θ

) and

g(θ

*)

●●●● ●● ●● ●● ●● ● ●● ●● ●●●● ● ●● ●● ● ● ●●● ●● ●● ●●●●● ●● ● ●● ●●● ● ● ●●●● ● ●●● ●●●●● ● ●● ●● ●● ●● ● ● ● ●● ●● ●● ● ●● ● ●● ●●● ●●● ● ●●● ●● ●

● ●● ●●● ● ●●● ●●●●● ●● ●● ● ●● ● ●●● ●● ● ●● ●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●●● ●●● ● ●● ●● ● ● ●●● ●●● ●● ●●● ● ● ●● ● ●●● ●● ● ●● ● ●● ● ●● ● ●● ●

● ●●● ●● ●●● ●●● ● ●● ●● ●● ● ●● ● ● ●● ●●●● ●● ● ●● ●●● ●●● ●● ● ●● ●● ●● ● ●● ● ● ●● ●● ● ●●● ●●● ●● ●● ●● ●●● ●●● ● ●●● ● ●●● ●●● ●● ●● ● ● ●●●● ●

●●● ● ●●● ●● ●●● ●●● ● ●● ● ●●● ●● ● ●● ● ●● ●●●● ●●●● ●● ●● ●● ●●●●● ● ●● ●● ●●● ● ●● ●●●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●● ●●●● ●● ●●●●● ●● ●●●

●● ●● ●● ●● ●●● ●● ● ●● ●●● ●●●● ●● ● ●● ●● ●● ● ●● ● ● ●●●●● ●● ●● ●●● ● ●● ●● ●●● ●● ● ●● ●●● ●● ● ●●●●● ●● ● ●● ●●●● ●●● ●● ● ●● ●● ●●● ●●● ● ●

●● ● ●● ●●●●●● ● ●● ● ●● ●● ●● ● ●● ● ●●●●●● ●●● ●● ●●● ● ●●●● ● ● ●●● ●●●● ●● ●● ●●● ●●● ● ●● ●● ●● ●● ●● ●●● ●●●● ● ●● ●●● ●● ●●●●●● ● ● ●● ●● ●● ●●● ●● ●● ●●● ● ●●● ● ●● ●●● ● ●●● ●● ●● ● ● ● ●●●●● ●● ●● ●● ● ●●● ●●●● ● ●● ● ●●● ●● ● ● ●● ●●● ● ●● ●● ●● ●●● ●● ●●●● ●●● ● ● ●●●● ●●● ●● ●

Algorithm stuck ina local minimum

0.05θ*

• For large α, the algorithm ‘fails‘.• The optimal (or safe) choice of α depends on the data, the

model and the prior range.


The threshold-acceptance rate curveIdea:• use the threshold-acceptance rate curve• avoid area with excessively high acceptance rate

0 5 10 15 20

050

100

150

θ

dist

ance

bet

ween

g(θ)

and

g(θ∗)

●● ●● ●● ●●●● ● ●●●● ● ●●● ●●● ●● ● ● ●●● ●● ●●● ● ●● ●● ● ●● ● ●●● ●● ●● ●●● ●● ● ●● ●● ●● ● ●● ●● ● ● ● ●● ● ●●●●● ● ● ●● ● ●●● ●● ●● ● ●● ●● ●● ● ●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Population 1

Population 2

Population 3

α(ε)


Exploit the threshold-acceptance rate curve

epsilon

acce

ptan

ce ra

te

Convex shapes: symptom ofsampling from local optima.

Proposed method:

• Set ε∗ = argmaxε∂2αt∂ε2

• if α(ε∗) not too small, εt = ε∗

• otherwiseεt = argminε ∆(( ε

εt−1, αt(ε)αt(εt−1)), (0, 1))


Estimating the threshold-acceptance rate curve

• Acceptance rate

αt(ε) =

∫ ∫qt(θ)f (x|θ)1 (∆(x, x∗) ≤ ε) dx dθ

where qt(θ) =∑N

i=1 ω(i,t−1)Kt(θ|θ(i,t−1))

• No closed-form expression and difficult to approximate

• For deterministic models, use Unscented Transform (UT)

Parameter space Output space


Estimating the threshold-acceptance rate curve

• Acceptance rate

αt(ε) =

∫ ∫qt(θ)f (x|θ)1 (∆(x, x∗) ≤ ε) dx dθ

where qt(θ) =∑N

i=1 ω(i,t−1)Kt(θ|θ(i,t−1))

• No closed-form expression and difficult to approximate• For deterministic models, use Unscented Transform (UT)

Parameter space Output space


Adaptive method for threshold choice

At each population t,1. generate a population of perturbed particles,2. fit a Gaussian mixture model to the perturbed population3. estimate

pt(x) =

∫qt(θ)f (x|θ)dθ

using the unscented transform independently for eachcomponent of the Gaussian mixture,

4. estimateαt(ε) =

∫pt(x)1 (∆(x, x∗) ≤ ε) dx

RemarkThe threshold-acceptance rate curve depends on the perturbationkernel.


Computational efficiency of the adaptive methodThe Repressilator system:


Computational efficiency of the adaptive methodChemical reaction system illustrating the ”local minimum problem”

100

200

300

400

500

Failure


Take-home messages

Approximate Bayesian ComputationTo be used for parameter inference when the likelihood can not be

computed but it is possible to simulate from the model.

Sequential approachesComputationally more efficient.

Perturbation kernelGaussian random walk with the

optimal (local) covariance matrix• low computational cost• high acceptance rate

Threshold schedule• the quantile approach can

lead to the wrong posterior• exploit the threshold

-acceptance rate curve

Filippi et al, 2013; Silk, Filippi, Stumpf (2013.)


References

Beaumont, M., Cornuet, J.-M., Marin, J.-M., and Robert, C. (2009).Adaptive approximate Bayesian computation. Biometrika.

Blum, M. and Francois, O. (2010). Non-linear regression models forapproximate Bayesian computation. Statist. Comput.

Cappe, O., Guillin, A., Marin, J.-M., and Robert, C. (2004). PopulationMonte Carlo. J. Comput. Graph. Statist.

Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlosamplers. JRSS Series B

Fagundes, N., et al. (2007). Statistical evaluation of alternative modelsof human evolution. PNAS.

Filippi, S., Barnes, C., Cornebise, J., and Stumpf, M. (2013). Onoptimality of kernels for approximate Bayesian computation usingsequential Monte Carlo. SAGMB

Marjoram, P., Molitor, J., Plagnol, V., and Tavare, S. (2003). Markovchain Monte Carlo without likelihoods. PNAS.


References

Ratmann, O., et al. (2007). Using likelihood-free inference to compareevolutionary dynamics of the protein networks of H. pylori and P.falciparum. PLoS Comput Biol .

Silk, D., Filippi, S., and Stumpf, M. (2013). Optimizingthreshold-schedules for sequential approximate Bayesian computation:applications to molecular systems. SAGMB

Sisson, S. A., Fan, Y., and Tanaka, M. (2007). Sequential Monte Carlowithout likelihoods. PNAS.

Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, M. (2009).Approximate Bayesian computation scheme for parameter inference andmodel selection in dynamical systems. JRSS Interface.


Other references

Blum, M. (2010). Approximate Bayesian Computation: a non-parametricperspective. J. American Statist. Assoc.

Fearnhead, P. and Prangle, D. (2012). Constructing summary statisticsfor Approximate Bayesian Computation: semi-automatic ApproximateBayesian Computation. JRSS: Series B, (With discussion.).

Marin, J., Pudlo, P., Robert, C., and Ryder, R. (2012). ApproximateBayesian computational methods. Statistics and Computing,.

Robert, C., Cornuet, J.-M., Marin, J.-M., and Pillai, N. (2011). Lack ofconfidence in ABC model choice. PNAS.

Rubin, D. (1984). Bayesianly justifiable and relevant frequencycalculations for the applied statistician. Ann. Statist.

Wilkinson, R. (2013). Approximate Bayesian computation (ABC) givesexact results under the assumption of model error. SAGMB.


approximate bayesian computation - oxford …filippi/teaching/cdt/abclecture.pdfapproximate bayesian...

Documents