considerate approaches to abc model selection

Considerate Approaches to ABC Model Selection

Michael P.H. Stumpf, ChristopherBarnes, Sarah Filippi, Thomas Thorne

Theoretical Systems Biology Group

26/06/2012

Considerate Approaches to ABC Model Selection Stumpf et al. 1 of 15

Evolving Networks

(a) Duplication attachment (b) Duplication attachmentwith complimentarity

(c) Linear preferentialattachment

(d) General scale-free

Considerate Approaches to ABC Model Selection Stumpf et al. Model Selection 2 of 15

Inference and Model Selection

We have observed data, D, that was generated by some system thatwe seek to describe by a mathematical model. In principle we canhave a model-set, M = M1, . . . , Mν, where each model Mi has anassociated parameter θi .We may know the different constituent parts of the system, Xi , andhave measurements for some or all of them under some experimentaldesigns, T.

Model Posterior︷︸︸︷Pr(Mi |T,D)

Likelihood︷︸︸︷Pr(D|Mi ,T)

Prior︷︸︸︷π(Mi)

ν∑j=1

Pr(D|Mj ,T)π(Mj)︸︷︷︸Evidence

For complicated models and/ordetailed data the likelihoodevaluation can becomeprohibitively expensive.

Approximate InferenceWe can approximate the likelihood and/or the models. The “true”model is unlikely to be in M anyway.

Model Posterior︷︸︸︷Pr(Mi |T,D)

ν∑j=1

Model Posterior︷︸︸︷Pr(Mi |T,D)=

ν∑j=1

Approximate Bayesian Computation

We can define the posterior as

p(θi |x) =f (x |θi)π(θi)

p(x)Here fi(x |θ) is the likelihood which is often hard to evaluate; considerfor example

y = max[0, y+g1+y×g2] with g1, g2 ∼ N(0,σ1/2) anddydt

= g(y ; θ).

But we can still simulate from the data-generating model, whence

p(θi |x) =∫X

1(y = x)f (y |θi)π(θi)

p(x)dy

≈∫X

1 (∆(y , x) < ε) f (y |θi)π(θi)

p(x)dy

Solutions for Complex Problems (?)Approximate (i) data, (ii) model or (iii) distance.

Considerate Approaches to ABC Model Selection Stumpf et al. Approximate Bayesian Computation 4 of 15

= g(y ; θ).

p(θi |x) =∫X

p(x)dy

≈∫X

1 (∆(y , x) < ε) f (y |θi)π(θi)

p(x)dy

= g(y ; θ).

p(θi |x) =∫X

p(x)dy

≈∫X

1 (∆(y , x) < ε) f (y |θi)π(θi)

p(x)dy

ABC with Summary Statistics

If the data, D, are very complex and detailed, direct comparisonbetween real and simulated data becomes prohibitive. In suchsituations, which originally motivated ABC approaches, summarystatistics of the data are compared. We then have

pS,ε(θi |D) ∝∫X

1 (∆ (S(x)), S(yθ)) < ε) f (y |θ)π(θi)dy

Sufficient StatisticsThis only works is the statistic S(.) is sufficient, i.e. if for s = S(x) wehave

p(x |s, θ) = p(x |s)

Sufficency for Model SelectionIf S(.) is sufficient for parameter estimation (in all models iconsidered) it is not necessarily sufficient for model selection (Robertet al., PNAS (2011)).

Considerate Approaches to ABC Model Selection Stumpf et al. ABC Summary Statistics 5 of 15

1 (∆ (S(x)), S(yθ)) < ε) f (y |θ)π(θi)dy

p(x |s, θ) = p(x |s)

1 (∆ (S(x)), S(yθ)) < ε) f (y |θ)π(θi)dy

p(x |s, θ) = p(x |s)

Generate data X ∼ N(1, 1) and use ABC to infer µ (assuming thatσ2 = 1 is known).

−4 −2 0 2 4min

−4 −2 0 2 4

−4 −2 0 2 4max

−4 −2 0 2 4

Role of Summary StatisticsMean (sufficient) correctly

infers µ.

Max/Min capture someinformation on µ.

Var fails to capture anyinformation on µ.

We need a way of constructingsets of statistics that together are(approximately) sufficient.

Generate data X ∼ N(1, 1) and use ABC to infer µ (assuming thatσ2 = 1 is known).

−4 −2 0 2 4min

−4 −2 0 2 4

−4 −2 0 2 4max

−4 −2 0 2 4

Role of Summary StatisticsMean (sufficient) correctly

infers µ.

Max/Min capture someinformation on µ.

Var fails to capture anyinformation on µ.

We need a way of constructingsets of statistics that together are(approximately) sufficient.

A Closer Look at Summary Statistics

We interpret a summary statistic as a function,

S : Rd −→ Rw , S(x) = s.

If S is sufficient then (we include the model indicator variable in θ)

p(θ|x) = p(θ|s)

Information Theoretical PerspectiveA summary statistic is an information compression device. Now let Sbe a set of statistics which together are sufficient. Then the mutualinformation

I(Θ; X ) =

p(θ, x) logp(θ, x)

p(θ)p(x)dθdx = I(θ, S)

Constructing Minimally Sufficient Summary StatisticsWe seek the set U ⊆ S with minimal cardinality such thatI(Θ; S) = I(Θ;U).

S : Rd −→ Rw , S(x) = s.

p(θ|x) = p(θ|s)Information Theoretical PerspectiveA summary statistic is an information compression device. Now let Sbe a set of statistics which together are sufficient. Then the mutualinformation

I(Θ; X ) =

S : Rd −→ Rw , S(x) = s.

p(θ|x) = p(θ|s)Information Theoretical PerspectiveA summary statistic is an information compression device. Now let Sbe a set of statistics which together are sufficient. Then the mutualinformation

I(Θ; X ) =

Constructing Sufficient Statistics

Proposition

Let X be a random variable generated according to f (·|θ). Let S be asummary statistic and U and T two subsets of S such that U = U(X ),T = T(X ) and S = S(X ) satisfy U ⊂ T ⊂ S. We have

I(Θ; S|T ) = I(Θ; S|U) − I(Θ; T |U) .

In order to construct a subset T of S such that I(Θ; S|T ) = 0, it is thussufficient to add statistics from S one by one until the condition holds.If we denote by S(k) the k th statistic to be added (with k 6 w) we haveS(k) = S(k)(X ), and then

I(Θ; S|S(1), . . . , S(k+1)) 6 I(Θ; S|S(1), . . . , S(k)) .

I(Θ; S|U) =

p(θ, S(x),U(x)) logp(θ, S(x)|U(x))

p(θ|U(x))p(S(x)|U(x))dxdθ

p(S(x)) [KL(p(Θ|S(x))||p(Θ|U(x)))] dx

= Ep(X) [KL(p(Θ|S(X ))||p(Θ|U(X )))]

An Impossible Algorithm• for all subsets u∗ ⊆ s∗ , perform ABC to obtain estimates pε(Θ|u∗)• determine the setA = u∗ ⊂ s∗ such that KL (pε(Θ|s∗)||pε(Θ|u∗)) = 0,

• the desired subset is argminu∗∈A |u∗|

input: a sufficient set of statistics whose values on the dataset is s∗ =s∗1 , . . . , s∗w , a threshold δoutput: a subset v∗ of s∗

choose randomly u∗ in s∗

v∗ ← u∗

q∗ ← s∗\v∗

repeatrepeat

if q∗ = Ø then return v∗

end ifchoose randomly u∗ in q∗

q∗ ← q∗\u∗

perform ABC to obtain pε(Θ|v∗, u∗)until KL (pε(Θ|v∗, u∗)||pε(Θ|v∗)) > δoptionally: v∗ ← OrderDependency (v∗, u∗)v∗ ← v∗ ∪ u∗

q∗ ← s∗\v∗

until q∗ = Øreturn v∗

Examples: Normal Distributions

y1, ...yd ∼ N(µ,σ21) and y1, ...yd ∼ N(µ,σ2

mean S2 range max random

Examples: Normal Distributions

y1, ...yd ∼ N(µ,σ21) and y1, ...yd ∼ N(µ,σ2

−2 0 2 4 6 8

log(BF) predicted

−2 0 2 4 6 8

log(BF) predicted

Examples: Population Genetics

Constant PopulationSize

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

ExponentialPopulation Growth

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

Two-Island Modelwith Migration

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

[S1] Number of Segregating Sites; [S2] Number of Distinct Haplotypes,; [S3] Haplotype Homozygosity; [S4] Average SNPHomozygosity; [S5] Number of occurrences of most common haplotype; [S6] Mean number of pair-wise differences betweenhaplotypes; [S7] Number of Singleton Haplotypes; [S8] Number of Singleton SNPs; [S9] Linkage Disequilibrium.

Summary Statistic ChoiceThe choice of summary statistics appears to depend subtely on thetrue data-generating model. In light of coalescent processes this is,however, to be expected.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

Examples: Random Walks

Classical RandomWalk

S1 S2 S3 S4 S5

Persistent RandomWalk

S1 S2 S3 S4 S5

Biased RandomWalk

S1 S2 S3 S4 S5

[S1] Mean square displacement; [S2] Mean x and y displacement; [S3] Mean square x and y displacement; [S4] Straightnessindex; [S5] Eigenvalues of gyration tensor.

Parameter Sufficiency for Complex ProblemsHere all statistics that have been chosen for parameter estimation arealso chosen for model selection.

Conditioning on Information

s1 s2 s3

StatisticsSufficient: Implicates same area as

full data.

Ancillary: Implicates all values of θequally.

What is the meaning ofp(θ|s0, s1, . . . , sn)?

Let s = (s0, s1, . . . , sn), andassume I(θ, s) < I(θ, x) butε→ 0.This can happen for sufficientand ancillary s. In the lattercase we obtain

p(θ|s) = π(θ).

How about

p(t |s)

if s is not (quite) sufficient?

Considerate Approaches to ABC Model Selection Stumpf et al. Interpreting ABC 10 of 15

full data.

p(θ|s) = π(θ).

How about

p(t |s)

full data.

p(θ|s) = π(θ).

How about

p(t |s)

full data.

p(θ|s) = π(θ).

How about

p(t |s)

Model Selection vs. Model Checking

Model Selection: Several models M ∈M are compared and one ormore are chosen in light of the data: Find models whichare better than others.

Model Checking: The quality of a model Mi is assessed against theavailable data: Determine if a model is actually ‘good’.

Alternative Approach: ABCµ [Ratmann et al., PNAS].

Posterior Predictive ChecksWe are interested in the posterior predictive distribution,

p(t(X )|s(X )) =

p(t(X )|θ)p(θ|s(X ))dθ.

In particular we have

p(s(X )|s(X )) 6= p(s(X )|X )

unless t(X ) is sufficient.

Model Selection vs. Model Checking

Model Selection: Several models M ∈M are compared and one ormore are chosen in light of the data: Find models whichare better than others.

Model Checking: The quality of a model Mi is assessed against theavailable data: Determine if a model is actually ‘good’.

Alternative Approach: ABCµ [Ratmann et al., PNAS].

Posterior Predictive ChecksWe are interested in the posterior predictive distribution,

p(t(X )|s(X )) =

p(t(X )|θ)p(θ|s(X ))dθ.

In particular we have

p(s(X )|s(X )) 6= p(s(X )|X )

unless t(X ) is sufficient.

ABC on Network Data

(e) Duplication attachment (f) Duplication attachmentwith complimentarity

(g) Linear preferentialattachment

(h) General scale-free

Considerate Approaches to ABC Model Selection Stumpf et al. Network Evolution 12 of 15

ABC on Network Data

Summarizing Networks• Data are noisy and incomplete.• We can simulate models of network

evolution, but this does not allow us tocalculate likelihoods for all but verytrivial models.

• There is also no sufficient statistic thatwould allow us to summarize networks,so ABC approaches require somethought.

• Many possible summary statistics ofnetworks are expensive to calculate.

Full likelihood: Wiuf et al., PNAS (2006).

ABC: Ratman et al., PLoS Comp.Biol. (2008).

ABC (better): Thorne & Stumpf, J.Roy.Soc. Interface (2012).

Stumpf & Wiuf, J. Roy. Soc. Interface (2010).

Spectral Distances

d e0 1 1 1 01 0 1 1 01 1 0 0 01 1 0 0 10 0 0 1 0

a b c d e

Graph SpectraGiven a graph G with nodes N and edges (i, j) ∈ E with i, j ∈ N, theadjacency matrix, A, of the graph is defined by

ai,j =

1 if (i, j) ∈ E ,

0 otherwise.

The eigenvalues, λ, of this matrix provide one way of defining thegraph spectrum.

Spectral Distances

A simple distance measure between graphs having adjacencymatrices A and B, known as the edit distance, is to count the numberof edges that are not shared by both graphs,

D(A, B) =∑

(ai,j − bi,j)2.

However for unlabelled graphs we require some mapping h fromi ∈ NA to i ′ ∈ NB that minimizes the distance

D(A, B) > D ′h(A, B) =∑

(ai,j − bh(i),h(j))2,

Given a spectrum (which is relatively cheap to compute) we have

D ′(A, B) =∑

(λ(α)l − λ

Spectral Distances

D(A, B) =∑

(ai,j − bi,j)2.

D(A, B) > D ′h(A, B) =∑

(ai,j − bh(i),h(j))2,

D ′(A, B) =∑

(λ(α)l − λ

Spectral Distances

D(A, B) =∑

(ai,j − bi,j)2.

D(A, B) > D ′h(A, B) =∑

(ai,j − bh(i),h(j))2,

D ′(A, B) =∑

(λ(α)l − λ

Protein Interaction Network Data

Species Proteins Interactions Genome size Sampling fraction

S.cerevisiae 5035 22118 6532 0.77

D. melanogaster 7506 22871 14076 0.53

H. pylori 715 1423 1589 0.45

E. coli 1888 7008 5416 0.35

DA DAC LPA SF DACL DACR

Organism

S.cerevisae

D.melanogaster

H.pylori

E.coli

Model Selection• Inference here was based on all

the data, not summarystatistics.

• Duplication models receive thestrongest support from the data.

• Several models receive supportand no model is chosenunambiguously.

S.cerevisiae 5035 22118 6532 0.77

D. melanogaster 7506 22871 14076 0.53

H. pylori 715 1423 1589 0.45

E. coli 1888 7008 5416 0.35

Organism

S.cerevisae

D.melanogaster

H.pylori

E.coli

S.cerevisiae 5035 22118 6532 0.77

D. melanogaster 7506 22871 14076 0.53

H. pylori 715 1423 1589 0.45

E. coli 1888 7008 5416 0.35

Organism

S.cerevisae

D.melanogaster

H.pylori

E.coli

0.0 0.4 0.8

0 2 4 6 8 10

0.0 0.4 0.8

0 2 4 6 8 10

S.cerevisiaeD. melanogasterH. pyloriE. coli

Considerate Use of ABC

• ABC is a tool for situations where conventional statisticalapproaches fail or are too cumbersome.

• If all the data are used then this is (relatively) unproblematic; if thedata are compressed/corrupted then caution is required.

• Some of the issues arising in ABC mirror those also encounteredin “conventional” statistics:

Any Bayesian inference uses the data only via the minimalsufficient statistic. This is because the calculation of theposterior distribution involves multiplying the likelihood by theprior and normalizing. Any factor of the likelihood that is afunction of y alone will disappear after normalization.

D. Cox (2006).• In other cases it seems prudent to accept the additional (and

considerable) computational cost of constructing suitable summarystatistics (such as in Barnes et al., Stat&Comp 2012).

Considerate Approaches to ABC Model Selection Stumpf et al. Conclusion 15 of 15

Acknowledgements

Considerate Approaches to ABC Model Selection Stumpf et al. Conclusion 15 of 15

considerate approaches to abc model selection

Technology

model selectionwe

model mi

mathematical model

abc model selection

abc model selection

model posteriorprmi

approximate i data

parameter i