bayesian model choice (and some alternatives)

125
Bayesian model choice (and some alternatives) Christian P. Robert Universit´ e Paris-Dauphine, IuF, & CRESt http://www.ceremade.dauphine.fr/ ~ xian November 20, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 1 / 64

Upload: christian-robert

Post on 01-Nov-2014

2.067 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Bayesian model choice (and some alternatives)

Bayesian model choice(and some alternatives)

Christian P. Robert

Universite Paris-Dauphine, IuF, & CRESthttp://www.ceremade.dauphine.fr/~xian

November 20, 2010

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 1 / 64

Page 2: Bayesian model choice (and some alternatives)

Outline

Anyone not shocked by the Bayesian theory of inference has not understood itSenn, BA., 2008

1 Introduction

2 Tests and model choice

3 Incoherent inferences

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 2 / 64

Page 3: Bayesian model choice (and some alternatives)

Vocabulary and concepts

Bayesian inference is a coherent mathematical theorybut I don’t trust it in scientific applications.

Gelman, BA, 2008

1 IntroductionModelsThe Bayesian frameworkImproper prior distributionsNoninformative prior distributions

2 Tests and model choice

3 Incoherent inferences

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 3 / 64

Page 4: Bayesian model choice (and some alternatives)

Parametric model

Bayesians promote the idea that a multiplicity of parameters can be handled viahierarchical, typically exchangeable, models, but it seems implausible that this

could really work automatically [instead of] giving reasonable answers usingminimal assumptions.

Gelman, BA, 2008

Observations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)

x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)

Associated likelihood`(θ|x) = f(x|θ)

[inverted density & starting point]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64

Page 5: Bayesian model choice (and some alternatives)

Parametric model

Bayesians promote the idea that a multiplicity of parameters can be handled viahierarchical, typically exchangeable, models, but it seems implausible that this

could really work automatically [instead of] giving reasonable answers usingminimal assumptions.

Gelman, BA, 2008

Observations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)

x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)

Associated likelihood`(θ|x) = f(x|θ)

[inverted density & starting point]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64

Page 6: Bayesian model choice (and some alternatives)

Bayes theorem 101

Bayes theorem = Inversion of probabilities

If A and E are events such that P (E) 6= 0, P (A|E) and P (E|A) arerelated by

P (A|E) =P (E|A)P (A)

P (E|A)P (A) + P (E|Ac)P (Ac)

=P (E|A)P (A)

P (E)

[Thomas Bayes (?)]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64

Page 7: Bayesian model choice (and some alternatives)

Bayes theorem 101

Bayes theorem = Inversion of probabilities

If A and E are events such that P (E) 6= 0, P (A|E) and P (E|A) arerelated by

P (A|E) =P (E|A)P (A)

P (E|A)P (A) + P (E|Ac)P (Ac)

=P (E|A)P (A)

P (E)

[Thomas Bayes (?)]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64

Page 8: Bayesian model choice (and some alternatives)

Bayesian approach

The impact of treating x as a fixed constantis to increase statistical power as an artefact

Templeton, Molec. Ecol., 2009

New perspective

Uncertainty on the parameters θ of a model modeled through aprobability distribution π on Θ, called prior distribution

Inference based on the distribution of θ conditional on x, π(θ|x),called posterior distribution

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ) dθ

.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64

Page 9: Bayesian model choice (and some alternatives)

Bayesian approach

The impact of treating x as a fixed constantis to increase statistical power as an artefact

Templeton, Molec. Ecol., 2009

New perspective

Uncertainty on the parameters θ of a model modeled through aprobability distribution π on Θ, called prior distribution

Inference based on the distribution of θ conditional on x, π(θ|x),called posterior distribution

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ) dθ

.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64

Page 10: Bayesian model choice (and some alternatives)

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

Semantic drift from unknown to random

Actualization of the information on θ by extracting the information onθ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations(conditional perspective)

Unique way to give meaning to statements like P(θ > 0)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64

Page 11: Bayesian model choice (and some alternatives)

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

Semantic drift from unknown to random

Actualization of the information on θ by extracting the information onθ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations(conditional perspective)

Unique way to give meaning to statements like P(θ > 0)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64

Page 12: Bayesian model choice (and some alternatives)

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

Semantic drift from unknown to random

Actualization of the information on θ by extracting the information onθ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations(conditional perspective)

Unique way to give meaning to statements like P(θ > 0)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64

Page 13: Bayesian model choice (and some alternatives)

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

Semantic drift from unknown to random

Actualization of the information on θ by extracting the information onθ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations(conditional perspective)

Unique way to give meaning to statements like P(θ > 0)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64

Page 14: Bayesian model choice (and some alternatives)

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

Semantic drift from unknown to random

Actualization of the information on θ by extracting the information onθ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations(conditional perspective)

Unique way to give meaning to statements like P(θ > 0)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64

Page 15: Bayesian model choice (and some alternatives)

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

Operates conditional upon the observations

Incorporates the requirement of the Likelihood Principle

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ

Provides a complete inferential machinery

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64

Page 16: Bayesian model choice (and some alternatives)

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

Operates conditional upon the observations

Incorporates the requirement of the Likelihood Principle

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ

Provides a complete inferential machinery

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64

Page 17: Bayesian model choice (and some alternatives)

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

Operates conditional upon the observations

Incorporates the requirement of the Likelihood Principle

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ

Provides a complete inferential machinery

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64

Page 18: Bayesian model choice (and some alternatives)

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

Operates conditional upon the observations

Incorporates the requirement of the Likelihood Principle

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ

Provides a complete inferential machinery

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64

Page 19: Bayesian model choice (and some alternatives)

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

Operates conditional upon the observations

Incorporates the requirement of the Likelihood Principle

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ

Provides a complete inferential machinery

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64

Page 20: Bayesian model choice (and some alternatives)

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any value between 0 and∞ (...), we must use ∞ instead of 1 to denote certainty.

Jeffreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-finite measure πsuch that ∫

Θπ(θ) dθ = +∞

Improper prior distribution[Weird? Inappropriate?? report!! ]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64

Page 21: Bayesian model choice (and some alternatives)

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any value between 0 and∞ (...), we must use ∞ instead of 1 to denote certainty.

Jeffreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-finite measure πsuch that ∫

Θπ(θ) dθ = +∞

Improper prior distribution

[Weird? Inappropriate?? report!! ]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64

Page 22: Bayesian model choice (and some alternatives)

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any value between 0 and∞ (...), we must use ∞ instead of 1 to denote certainty.

Jeffreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-finite measure πsuch that ∫

Θπ(θ) dθ = +∞

Improper prior distribution[Weird? Inappropriate?? report!! ]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64

Page 23: Bayesian model choice (and some alternatives)

Justifications

If the parameter may have any value from −∞ to +∞,its prior probability should be taken as uniformly distributed

Jeffreys, ToP, 1939

Automated prior determination often leads to improper priors

1 Similar performances of estimators derived from these generalizeddistributions

2 Improper priors as limits of proper distributions in many[mathematical] senses

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64

Page 24: Bayesian model choice (and some alternatives)

Justifications

If the parameter may have any value from −∞ to +∞,its prior probability should be taken as uniformly distributed

Jeffreys, ToP, 1939

Automated prior determination often leads to improper priors

1 Similar performances of estimators derived from these generalizeddistributions

2 Improper priors as limits of proper distributions in many[mathematical] senses

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64

Page 25: Bayesian model choice (and some alternatives)

Further justifications

There is no good objective principle for choosing a noninformative prior (even ifthat concept were mathematically defined, which it is not)

Gelman, BA, 2008

4 Robust answer against possible misspecifications of the prior

5 Frequencial justifications, such as:

(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)

6 Improper priors [much] prefered to vague proper priors like N (0, 106)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64

Page 26: Bayesian model choice (and some alternatives)

Further justifications

There is no good objective principle for choosing a noninformative prior (even ifthat concept were mathematically defined, which it is not)

Gelman, BA, 2008

4 Robust answer against possible misspecifications of the prior5 Frequencial justifications, such as:

(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)

6 Improper priors [much] prefered to vague proper priors like N (0, 106)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64

Page 27: Bayesian model choice (and some alternatives)

Further justifications

There is no good objective principle for choosing a noninformative prior (even ifthat concept were mathematically defined, which it is not)

Gelman, BA, 2008

4 Robust answer against possible misspecifications of the prior5 Frequencial justifications, such as:

(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)

6 Improper priors [much] prefered to vague proper priors like N (0, 106)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64

Page 28: Bayesian model choice (and some alternatives)

Validation

The mistake is to think of them as representing ignoranceLindley, JASA, 1990

Extension of the posterior distribution π(θ|x) associated with an improperprior π as given by Bayes’s formula

π(θ|x) =f(x|θ)π(θ)∫

Θ f(x|θ)π(θ) dθ,

when ∫Θf(x|θ)π(θ) dθ <∞

Delete emotionally loaded names

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64

Page 29: Bayesian model choice (and some alternatives)

Validation

The mistake is to think of them as representing ignoranceLindley, JASA, 1990

Extension of the posterior distribution π(θ|x) associated with an improperprior π as given by Bayes’s formula

π(θ|x) =f(x|θ)π(θ)∫

Θ f(x|θ)π(θ) dθ,

when ∫Θf(x|θ)π(θ) dθ <∞

Delete emotionally loaded names

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64

Page 30: Bayesian model choice (and some alternatives)

Noninformative priors

...cannot be expected to represent exactly total ignorance about the problem, butshould rather be taken as reference priors, upon which everyone could fall back

when the prior information is missing.Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!

In the absence of prior information, prior distributions solely derived fromthe sample distribution f(x|θ)Difficulty with uniform priors, lacking invariance properties. Rather useJeffreys’ prior.

[Jeffreys, 1939; Robert, Chopin & Rousseau, 2009]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64

Page 31: Bayesian model choice (and some alternatives)

Noninformative priors

...cannot be expected to represent exactly total ignorance about the problem, butshould rather be taken as reference priors, upon which everyone could fall back

when the prior information is missing.Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solely derived fromthe sample distribution f(x|θ)

Difficulty with uniform priors, lacking invariance properties. Rather useJeffreys’ prior.

[Jeffreys, 1939; Robert, Chopin & Rousseau, 2009]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64

Page 32: Bayesian model choice (and some alternatives)

Noninformative priors

...cannot be expected to represent exactly total ignorance about the problem, butshould rather be taken as reference priors, upon which everyone could fall back

when the prior information is missing.Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solely derived fromthe sample distribution f(x|θ)Difficulty with uniform priors, lacking invariance properties. Rather useJeffreys’ prior.

[Jeffreys, 1939; Robert, Chopin & Rousseau, 2009]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64

Page 33: Bayesian model choice (and some alternatives)

Tests and model choice

The Jeffreys-subjective synthesis betrays a much more dangerous confusion thanthe Neyman-Pearson-Fisher synthesis as regards hypothesis tests

Senn, BA, 2008

1 Introduction

2 Tests and model choiceBayesian testsOpposition to classical testsModel choicePseudo-Bayes factorsCompatible priorsVariable selection

3 Incoherent inferencesChristian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 14 / 64

Page 34: Bayesian model choice (and some alternatives)

Construction of Bayes tests

What is almost never used, however, is the Jeffreys significance test.Senn, BA, 2008

Definition (Test)

Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statisticalmodel, a test is a statistical procedure that takes its values in {0, 1}.

Example (Normal mean)

For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64

Page 35: Bayesian model choice (and some alternatives)

Construction of Bayes tests

What is almost never used, however, is the Jeffreys significance test.Senn, BA, 2008

Definition (Test)

Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statisticalmodel, a test is a statistical procedure that takes its values in {0, 1}.

Example (Normal mean)

For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64

Page 36: Bayesian model choice (and some alternatives)

Decision-theoretic perspective

Loss functions [are] not relevant to statistical inferenceGelman, BA, 2008

Theorem (Optimal Bayes decision)

Under the 0− 1 loss function

L(θ, d) =

0 if d = IΘ0(θ)a0 if d = 1 and θ 6∈ Θ0

a1 if d = 0 and θ ∈ Θ0

the Bayes procedure is

δπ(x) =

{1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)0 otherwise

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64

Page 37: Bayesian model choice (and some alternatives)

Decision-theoretic perspective

Loss functions [are] not relevant to statistical inferenceGelman, BA, 2008

Theorem (Optimal Bayes decision)

Under the 0− 1 loss function

L(θ, d) =

0 if d = IΘ0(θ)a0 if d = 1 and θ 6∈ Θ0

a1 if d = 0 and θ ∈ Θ0

the Bayes procedure is

δπ(x) =

{1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)0 otherwise

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64

Page 38: Bayesian model choice (and some alternatives)

A function of posterior probabilities

The method posits two or more alternative hypotheses and tests their relative fitsto some observed statistics — Templeton, Mol. Ecol., 2009

Definition (Bayes factors)

For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0

B01 =π(Θ0|x)π(Θc

0|x)

/π(Θ0)π(Θc

0)=∫

Θ0

f(x|θ)π0(θ)dθ

/∫Θc0

f(x|θ)π1(θ)dθ

[Good, 1958 & Jeffreys, 1961]

pseudo-Bayes factors

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 17 / 64

Page 39: Bayesian model choice (and some alternatives)

Self-contained concept

Having a high relative probability does not mean that a hypothesis is true orsupported by the data — Templeton, Mol. Ecol., 2009

Non-decision-theoretic:

eliminates choice of π(Θ0)Bayesian/marginal equivalent to the likelihood ratio

Jeffreys’ scale of evidence:I if log10(Bπ10) between 0 and 0.5, evidence against H0 weak,I if log10(Bπ10) 0.5 and 1, evidence substantial,I if log10(Bπ10) 1 and 2, evidence strong andI if log10(Bπ10) above 2, evidence decisive

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 18 / 64

Page 40: Bayesian model choice (and some alternatives)

A major modification

Considering whether a location parameter α is 0. The prior is uniform and weshould have to take f(α) = 0 and B10 would always be infinite

Jeffreys, ToP, 1939

When the null hypothesis is supported by a set of measure 0, π(Θ0) = 0and thus π(Θ0|x) = 0.

[End of the story?!]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 19 / 64

Page 41: Bayesian model choice (and some alternatives)

Changing the prior to fit the hypotheses

Given that some logical overlap is common when dealing with complex models,this means that much of the literature is invalid

Templeton, Trends in Ecology and Evolution, 2010

Requirement

Define prior distributions under both assumptions,

π0(θ) ∝ π(θ)IΘ0(θ), π1(θ) ∝ π(θ)IΘ1(θ),

[under the standard dominating measures on Θ0 and Θ1], leading to

π(θ) = %0π0(θ) + %1π1(θ).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 20 / 64

Page 42: Bayesian model choice (and some alternatives)

Point null hypotheses

I have no patience for statistical methods that assign positive probability to pointhypotheses of the θ = 0 type that can never actually be true

Gelman, BA, 2008

Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Then

π(Θ0|x) =f(x|θ0)ρ0∫f(x|θ)π(θ) dθ

=f(x|θ0)ρ0

f(x|θ0)ρ0 + (1− ρ0)m1(x)

and Bayes factor

Bπ01(x) =

f(x|θ0)ρ0

m1(x)(1− ρ0)

/ρ0

1− ρ0=f(x|θ0)m1(x)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64

Page 43: Bayesian model choice (and some alternatives)

Point null hypotheses

I have no patience for statistical methods that assign positive probability to pointhypotheses of the θ = 0 type that can never actually be true

Gelman, BA, 2008

Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Then

π(Θ0|x) =f(x|θ0)ρ0∫f(x|θ)π(θ) dθ

=f(x|θ0)ρ0

f(x|θ0)ρ0 + (1− ρ0)m1(x)

and Bayes factor

Bπ01(x) =

f(x|θ0)ρ0

m1(x)(1− ρ0)

/ρ0

1− ρ0=f(x|θ0)m1(x)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64

Page 44: Bayesian model choice (and some alternatives)

Point null hypotheses (cont’d)

Example (Normal mean)

Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ2)

m1(x)f(x|0)

=

√σ2

σ2 + τ2exp

{τ2x2

2σ2(σ2 + τ2)

}and the posterior probability is

τ/x 0 0.68 1.28 1.961 0.586 0.557 0.484 0.351

10 0.768 0.729 0.612 0.366

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 22 / 64

Page 45: Bayesian model choice (and some alternatives)

Comparison with classical tests

The 95 percent frequentist intervals will live up to their advertised coverageclaims — Wasserman, BA, 2008

Standard/classical answer

Definition (p-value)

The p-value p(x) associated with a test is the largest significance level forwhich H0 is rejected

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 23 / 64

Page 46: Bayesian model choice (and some alternatives)

Problems with p-values

The use of P implies that a hypothesis that may be true may be rejected becauseit had not predicted observable results that have not occurred

Jeffreys, ToP, 1939

Evaluation of the wrong quantity, namely the probability to exceedthe observed quantity.(wrong conditioning)

Evaluation only under the null hypothesis

Huge numerical difference with the Bayesian range of answers

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 24 / 64

Page 47: Bayesian model choice (and some alternatives)

Bayesian lower bounds

If the Bayes estimator has good frequency behaviorthen we might as well use the frequentist method.

If it has bad frequency behavior then we shouldn’t use it.Wasserman, BA, 2008

Least favourable Bayesian answer is

B(x,GA) = infg∈GA

f(x|θ0)∫Θ f(x|θ)g(θ) dθ

,

i.e., if there exists a mle for θ, θ(x),

B(x,GA) =f(x|θ0)

f(x|θ(x))

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 25 / 64

Page 48: Bayesian model choice (and some alternatives)

Illustration

Example (Normal case)

When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

B(x,GA) = e−x2/2 and P(x,GA) =

(1 + ex

2/2)−1

,

i.e.p-value 0.10 0.05 0.01 0.001

P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004

[Quite different!]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64

Page 49: Bayesian model choice (and some alternatives)

Illustration

Example (Normal case)

When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

B(x,GA) = e−x2/2 and P(x,GA) =

(1 + ex

2/2)−1

,

i.e.p-value 0.10 0.05 0.01 0.001

P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004

[Quite different!]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64

Page 50: Bayesian model choice (and some alternatives)

Illustration

Example (Normal case)

When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

B(x,GA) = e−x2/2 and P(x,GA) =

(1 + ex

2/2)−1

,

i.e.p-value 0.10 0.05 0.01 0.001

P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004

[Quite different!]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64

Page 51: Bayesian model choice (and some alternatives)

Model choice and model comparison

There is no null hypothesis, which complicates the computation of sampling errorTempleton, Mol. Ecol., 2009

Choice among models:Several models available for the same observation(s)

Mi : x ∼ fi(x|θi), i ∈ I

where I can be finite or infinite

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 27 / 64

Page 52: Bayesian model choice (and some alternatives)

Bayesian resolution

The posterior probabilities are constructed by using a numerator that is a functionof the observation for a particular model, then divided by a denominator thatensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009

Probabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64

Page 53: Bayesian model choice (and some alternatives)

Bayesian resolution

The posterior probabilities are constructed by using a numerator that is a functionof the observation for a particular model, then divided by a denominator thatensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009

Probabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64

Page 54: Bayesian model choice (and some alternatives)

Bayesian resolution

The posterior probabilities are constructed by using a numerator that is a functionof the observation for a particular model, then divided by a denominator thatensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009

Probabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64

Page 55: Bayesian model choice (and some alternatives)

Bayesian resolution(2)

The numerators are not co-measurable across hypotheses, and the denominatorsare sums of non-co-measurable entities. This means that it is mathematically

impossible for them to be probabilities — Templeton, Mol. Ecol., 2009

take largest π(Mi|x) to determine “best” model,or use averaged predictive∑

j

π(Mj |x)∫

Θj

fj(x′|θj)πj(θj |x)dθj

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 29 / 64

Page 56: Bayesian model choice (and some alternatives)

Natural Occam’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until the contraryis shown; and new parameters in laws,when they are suggested, must betested one at a time, unless there isspecific reason to the contrary.

Jeffreys, ToP, 1939

The Bayesian approach naturally weights differently models with differentparameter dimensions (BIC being an approximative log-Bayes factor).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64

Page 57: Bayesian model choice (and some alternatives)

Natural Occam’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until the contraryis shown; and new parameters in laws,when they are suggested, must betested one at a time, unless there isspecific reason to the contrary.

Jeffreys, ToP, 1939

The Bayesian approach naturally weights differently models with differentparameter dimensions (BIC being an approximative log-Bayes factor).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64

Page 58: Bayesian model choice (and some alternatives)

A fundamental difficulty

1) ABC can and does produce results that are mathematically impossible;2) the “posterior probabilities” of ABC cannot possibly be true probability

measures;and 3) ABC is statistically incoherent.

Templeton, Trends in Ecology and Evolution, 2010

Improper priors are NOT allowed here

If ∫Θ1

π1(dθ1) =∞ or

∫Θ2

π2(dθ2) =∞

then either π1 or π2 cannot be coherently normalised

but thenormalisation matters in the Bayes factor Recall Bayes factor

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64

Page 59: Bayesian model choice (and some alternatives)

A fundamental difficulty

1) ABC can and does produce results that are mathematically impossible;2) the “posterior probabilities” of ABC cannot possibly be true probability

measures;and 3) ABC is statistically incoherent.

Templeton, Trends in Ecology and Evolution, 2010

Improper priors are NOT allowed here

If ∫Θ1

π1(dθ1) =∞ or

∫Θ2

π2(dθ2) =∞

then either π1 or π2 cannot be coherently normalised but thenormalisation matters in the Bayes factor Recall Bayes factor

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64

Page 60: Bayesian model choice (and some alternatives)

Normal illustration

Take x ∼ N (θ, 1) and H0 : θ = 0

Impact of the constant

x 0.0 1.0 1.65 1.96 2.58π(θ) = 1 0.285 0.195 0.089 0.055 0.014π(θ) = 10 0.0384 0.0236 0.0101 0.00581 0.00143

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 32 / 64

Page 61: Bayesian model choice (and some alternatives)

Vague proper priors are NOT the solution

Taking a proper prior and take a “very large” variance (e.g., BUGS)

willmost often result in an undefined or ill-defined limit

Example (Lindley’s paradox)

If testing H0 : θ = 0 when observing x ∼ N (θ, 1), under a normal N (0, α)prior π1(θ),

B01(x) α−→∞−→ 0

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64

Page 62: Bayesian model choice (and some alternatives)

Vague proper priors are NOT the solution

Taking a proper prior and take a “very large” variance (e.g., BUGS) willmost often result in an undefined or ill-defined limit

Example (Lindley’s paradox)

If testing H0 : θ = 0 when observing x ∼ N (θ, 1), under a normal N (0, α)prior π1(θ),

B01(x) α−→∞−→ 0

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64

Page 63: Bayesian model choice (and some alternatives)

Vague proper priors are NOT the solution

Taking a proper prior and take a “very large” variance (e.g., BUGS) willmost often result in an undefined or ill-defined limit

Example (Lindley’s paradox)

If testing H0 : θ = 0 when observing x ∼ N (θ, 1), under a normal N (0, α)prior π1(θ),

B01(x) α−→∞−→ 0

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64

Page 64: Bayesian model choice (and some alternatives)

Learning from the sample

It is possible for data to discriminate among a set of hypotheses without sayinganything about a proposition that is common to all the alternatives considered.

Seber, Evidence and Evolution, 2008

Definition (Learning sample)

Given an improper prior π, (x1, . . . , xn) is a learning sample ifπ(·|x1, . . . , xn) is proper and a minimal learning sample if none of itssubsamples is a learning sample

There is just enough information in a minimal learning sample to makeinference about θ under the prior π

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64

Page 65: Bayesian model choice (and some alternatives)

Learning from the sample

It is possible for data to discriminate among a set of hypotheses without sayinganything about a proposition that is common to all the alternatives considered.

Seber, Evidence and Evolution, 2008

Definition (Learning sample)

Given an improper prior π, (x1, . . . , xn) is a learning sample ifπ(·|x1, . . . , xn) is proper and a minimal learning sample if none of itssubsamples is a learning sample

There is just enough information in a minimal learning sample to makeinference about θ under the prior π

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64

Page 66: Bayesian model choice (and some alternatives)

Pseudo-Bayes factors

Idea

Use a first part x[i] of the data x to make the prior proper:

πi improper but πi(·|x[i]) proper

and ∫fi(x[n/i]|θi) πi(θi|x[i])dθi∫fj(x[n/i]|θj) πj(θj |x[i])dθj

independent of normalizing constant

Use remaining part x[n/i] to run test as if πj(θj |x[i]) was the true prior

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64

Page 67: Bayesian model choice (and some alternatives)

Pseudo-Bayes factors

Idea

Use a first part x[i] of the data x to make the prior proper:

πi improper but πi(·|x[i]) proper

and ∫fi(x[n/i]|θi) πi(θi|x[i])dθi∫fj(x[n/i]|θj) πj(θj |x[i])dθj

independent of normalizing constant

Use remaining part x[n/i] to run test as if πj(θj |x[i]) was the true prior

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64

Page 68: Bayesian model choice (and some alternatives)

Pseudo-Bayes factors

Idea

Use a first part x[i] of the data x to make the prior proper:

πi improper but πi(·|x[i]) proper

and ∫fi(x[n/i]|θi) πi(θi|x[i])dθi∫fj(x[n/i]|θj) πj(θj |x[i])dθj

independent of normalizing constant

Use remaining part x[n/i] to run test as if πj(θj |x[i]) was the true prior

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64

Page 69: Bayesian model choice (and some alternatives)

Motivation

Provides a working principle for improper priors

Gather enough information from data to achieve properness

and use this properness to run the test on remaining data

does not use the data x twice as in Aitkin’s (1991,2010)

Back later!

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64

Page 70: Bayesian model choice (and some alternatives)

Motivation

Provides a working principle for improper priors

Gather enough information from data to achieve properness

and use this properness to run the test on remaining data

does not use the data x twice as in Aitkin’s (1991,2010)

Back later!

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64

Page 71: Bayesian model choice (and some alternatives)

Motivation

Provides a working principle for improper priors

Gather enough information from data to achieve properness

and use this properness to run the test on remaining data

does not use the data x twice as in Aitkin’s (1991,2010)

Back later!

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64

Page 72: Bayesian model choice (and some alternatives)

Fractional Bayes factor

To test a theory, you need to test it against alternatives.Seber, Evidence and Evolution, 2008

Idea

use directly the likelihood to separate training sample from testing sample

BF12 = B12(x)×

∫Lb2(θ2)π2(θ2)dθ2

/∫Lb1(θ1)π1(θ1)dθ1

[O’Hagan, 1995]

Proportion b of the sample used to gain proper-ness

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64

Page 73: Bayesian model choice (and some alternatives)

Fractional Bayes factor

To test a theory, you need to test it against alternatives.Seber, Evidence and Evolution, 2008

Idea

use directly the likelihood to separate training sample from testing sample

BF12 = B12(x)×

∫Lb2(θ2)π2(θ2)dθ2

/∫Lb1(θ1)π1(θ1)dθ1

[O’Hagan, 1995]

Proportion b of the sample used to gain proper-ness

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64

Page 74: Bayesian model choice (and some alternatives)

Fractional Bayes factor (cont’d)

Example (Normal mean)

BF12 =

1√ben(b−1)x2

n/2

corresponds to exact Bayes factor for the prior N(0, 1−b

nb

)If b constant, prior variance goes to 0

If b =1n

, prior variance stabilises around 1

If b = n−α, α < 1, prior variance goes to 0 too.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 38 / 64

Page 75: Bayesian model choice (and some alternatives)

Compatibility principle

Further complicating dimensionality of test statistics is the fact that the modelsare often not nested, and one model may contain parameters that do not have

analogues in the other models and vice versaTempleton, Mol. Ecol., 2009

Difficulty of finding simultaneously priors on a collection of modelsEasier to start from a single prior on a “big” [encompassing] model and toderive others from a coherence principle

[Dawid & Lauritzen, 2000]Raw regression output

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64

Page 76: Bayesian model choice (and some alternatives)

Compatibility principle

Further complicating dimensionality of test statistics is the fact that the modelsare often not nested, and one model may contain parameters that do not have

analogues in the other models and vice versaTempleton, Mol. Ecol., 2009

Difficulty of finding simultaneously priors on a collection of models

Easier to start from a single prior on a “big” [encompassing] model and toderive others from a coherence principle

[Dawid & Lauritzen, 2000]Raw regression output

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64

Page 77: Bayesian model choice (and some alternatives)

Compatibility principle

Further complicating dimensionality of test statistics is the fact that the modelsare often not nested, and one model may contain parameters that do not have

analogues in the other models and vice versaTempleton, Mol. Ecol., 2009

Difficulty of finding simultaneously priors on a collection of modelsEasier to start from a single prior on a “big” [encompassing] model and toderive others from a coherence principle

[Dawid & Lauritzen, 2000]Raw regression output

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64

Page 78: Bayesian model choice (and some alternatives)

An illustration for linear regression

In the case M1 and M2 are two nested Gaussian linear regression modelswith Zellner’s g-priors and the same variance σ2 ∼ π(σ2):

M1 : y|β1, σ2 ∼ N (X1β1, σ

2) with

β1|σ2 ∼ N(s1, σ

2n1(XT1 X1)−1

)where X1 is a (n× k1) matrix of rank k1 ≤ n

M2 : y|β2, σ2 ∼ N (X2β2, σ

2) with

β2|σ2 ∼ N(s2, σ

2n2(XT2 X2)−1

),

where X2 is a (n× k2) matrix with span(X2) ⊆ span(X1)

[ c©Marin & Robert, Bayesian Core]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64

Page 79: Bayesian model choice (and some alternatives)

An illustration for linear regression

In the case M1 and M2 are two nested Gaussian linear regression modelswith Zellner’s g-priors and the same variance σ2 ∼ π(σ2):

M1 : y|β1, σ2 ∼ N (X1β1, σ

2) with

β1|σ2 ∼ N(s1, σ

2n1(XT1 X1)−1

)where X1 is a (n× k1) matrix of rank k1 ≤ nM2 : y|β2, σ

2 ∼ N (X2β2, σ2) with

β2|σ2 ∼ N(s2, σ

2n2(XT2 X2)−1

),

where X2 is a (n× k2) matrix with span(X2) ⊆ span(X1)

[ c©Marin & Robert, Bayesian Core]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64

Page 80: Bayesian model choice (and some alternatives)

Compatible g-priors

I don’t see any role for squared error loss, minimax, or the rest of what issometimes called statistical decision theory

Gelman, BA, 2008

Since σ2 is a nuisance parameter, minimize the Kullback-Leiblerdivergence between both marginal distributions conditional on σ2:m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution

β2|X2, σ2 ∼ N

(s∗2, σ

2n∗2(XT2 X2)−1

)with

s∗2 = (XT2 X2)−1XT

2 X1s1 n∗2 = n1

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64

Page 81: Bayesian model choice (and some alternatives)

Compatible g-priors

I don’t see any role for squared error loss, minimax, or the rest of what issometimes called statistical decision theory

Gelman, BA, 2008

Since σ2 is a nuisance parameter, minimize the Kullback-Leiblerdivergence between both marginal distributions conditional on σ2:m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution

β2|X2, σ2 ∼ N

(s∗2, σ

2n∗2(XT2 X2)−1

)with

s∗2 = (XT2 X2)−1XT

2 X1s1 n∗2 = n1

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64

Page 82: Bayesian model choice (and some alternatives)

Symmetrised compatible priors

If those prior probabilities are obscure, the same will be true of the posteriorprobabilities — Seber, Evidence and Evolution, 2008

Postulate: Previous principle requires embedded models (or anencompassing model) and proper priors, while being hard to implementoutside exponential families

We determine prior measures on two models M1 and M2, π1 and π2,directly by a compatibility principle.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64

Page 83: Bayesian model choice (and some alternatives)

Symmetrised compatible priors

If those prior probabilities are obscure, the same will be true of the posteriorprobabilities — Seber, Evidence and Evolution, 2008

Postulate: Previous principle requires embedded models (or anencompassing model) and proper priors, while being hard to implementoutside exponential familiesWe determine prior measures on two models M1 and M2, π1 and π2,directly by a compatibility principle.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64

Page 84: Bayesian model choice (and some alternatives)

Generalised expected posterior priors

[Perez & Berger, 2000]

EPP Principle

Starting from reference priors πN1 and πN2 , substitute by prior distributionsπ1 and π2 that solve the system of integral equations

π1(θ1) =∫

XπN1 (θ1 | x)m2(x)dx

and

π2(θ2) =∫

XπN2 (θ2 | x)m1(x)dx,

where x is an imaginary minimal training sample and m1, m2 are themarginals associated with π1 and π2 respectively.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 43 / 64

Page 85: Bayesian model choice (and some alternatives)

Motivations

Eliminates the “imaginary observation” device and proper-isationthrough part of the data by integration under the “truth”

Assumes that both models are equally valid and equipped with idealunknown priors

πi, i = 1, 2,

that yield “true” marginals balancing each model wrt the other

For a given π1, π2 is an expected posterior priorUsing both equations introduces symmetry into the game

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64

Page 86: Bayesian model choice (and some alternatives)

Motivations

Eliminates the “imaginary observation” device and proper-isationthrough part of the data by integration under the “truth”

Assumes that both models are equally valid and equipped with idealunknown priors

πi, i = 1, 2,

that yield “true” marginals balancing each model wrt the other

For a given π1, π2 is an expected posterior priorUsing both equations introduces symmetry into the game

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64

Page 87: Bayesian model choice (and some alternatives)

Motivations

Eliminates the “imaginary observation” device and proper-isationthrough part of the data by integration under the “truth”

Assumes that both models are equally valid and equipped with idealunknown priors

πi, i = 1, 2,

that yield “true” marginals balancing each model wrt the other

For a given π1, π2 is an expected posterior priorUsing both equations introduces symmetry into the game

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64

Page 88: Bayesian model choice (and some alternatives)

Bayesian coherence

Logical overlap is the norm for the complex models analyzed with ABC, so manyABC posterior model probabilities published to date are wrong.

Templeton, PNAS, 2009

Theorem (True Bayes factor)

If π1 and π2 are the EPPs and if their marginals are finite, then thecorresponding Bayes factor

B1,2(x)

is either a (true) Bayes factor or a limit of (true) Bayes factors.

Obviously only interesting when both π1 and π2 are improper.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64

Page 89: Bayesian model choice (and some alternatives)

Bayesian coherence

Logical overlap is the norm for the complex models analyzed with ABC, so manyABC posterior model probabilities published to date are wrong.

Templeton, PNAS, 2009

Theorem (True Bayes factor)

If π1 and π2 are the EPPs and if their marginals are finite, then thecorresponding Bayes factor

B1,2(x)

is either a (true) Bayes factor or a limit of (true) Bayes factors.

Obviously only interesting when both π1 and π2 are improper.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64

Page 90: Bayesian model choice (and some alternatives)

Variable selection

Regression setup where y regressed on a set {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)

Corresponding 2p submodels Mγ , where γ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by a binary representation,e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64

Page 91: Bayesian model choice (and some alternatives)

Variable selection

Regression setup where y regressed on a set {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)

Corresponding 2p submodels Mγ , where γ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by a binary representation,

e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64

Page 92: Bayesian model choice (and some alternatives)

Variable selection

Regression setup where y regressed on a set {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)

Corresponding 2p submodels Mγ , where γ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by a binary representation,e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64

Page 93: Bayesian model choice (and some alternatives)

Notations

For model Mγ ,

qγ variables included

t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)indices of the variables not included

For β ∈ Rp+1,

βt1(γ) =[β0, βt1,1(γ), . . . , βt1,qγ (γ)

]Xt1(γ) =

[1n|xt1,1(γ)| . . . |xt1,qγ (γ)

].

Submodel Mγ is thus

y|β, γ, σ2 ∼ N(Xt1(γ)βt1(γ), σ

2In)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64

Page 94: Bayesian model choice (and some alternatives)

Notations

For model Mγ ,

qγ variables included

t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)indices of the variables not included

For β ∈ Rp+1,

βt1(γ) =[β0, βt1,1(γ), . . . , βt1,qγ (γ)

]Xt1(γ) =

[1n|xt1,1(γ)| . . . |xt1,qγ (γ)

].

Submodel Mγ is thus

y|β, γ, σ2 ∼ N(Xt1(γ)βt1(γ), σ

2In)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64

Page 95: Bayesian model choice (and some alternatives)

Global and compatible priors

Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,

β|σ2 ∼ N (β, cσ2(XTX)−1)

and a Jeffreys prior for σ2,

π(σ2) ∝ σ−2

Noninformative g

Resulting compatible prior

βt1(γ) ∼ N((

XTt1(γ)Xt1(γ)

)−1XTt1(γ)Xβ, cσ

2(XTt1(γ)Xt1(γ)

)−1)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64

Page 96: Bayesian model choice (and some alternatives)

Global and compatible priors

Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,

β|σ2 ∼ N (β, cσ2(XTX)−1)

and a Jeffreys prior for σ2,

π(σ2) ∝ σ−2

Noninformative g

Resulting compatible prior

βt1(γ) ∼ N((

XTt1(γ)Xt1(γ)

)−1XTt1(γ)Xβ, cσ

2(XTt1(γ)Xt1(γ)

)−1)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64

Page 97: Bayesian model choice (and some alternatives)

Posterior model probability

Can be obtained in closed form:

π(γ|y) ∝ (c+ 1)−(qγ+1)/2

[yTy − cyTP1y

c+ 1+βTXTP1Xβ

c+ 1− 2yTP1Xβ

c+ 1

]−n/2.

Conditionally on γ, posterior distributions of β and σ2:

βt1(γ)|σ2, y, γ ∼ N

[c

c+ 1(U1y + U1Xβ/c),

σ2c

c+ 1

(XTt1(γ)

Xt1(γ)

)−1],

σ2|y, γ ∼ IG

[n

2,yTy

2− cyTP1y

2(c+ 1)+βTXTP1Xβ

2(c+ 1)− yTP1Xβ

c+ 1

].

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64

Page 98: Bayesian model choice (and some alternatives)

Posterior model probability

Can be obtained in closed form:

π(γ|y) ∝ (c+ 1)−(qγ+1)/2

[yTy − cyTP1y

c+ 1+βTXTP1Xβ

c+ 1− 2yTP1Xβ

c+ 1

]−n/2.

Conditionally on γ, posterior distributions of β and σ2:

βt1(γ)|σ2, y, γ ∼ N

[c

c+ 1(U1y + U1Xβ/c),

σ2c

c+ 1

(XTt1(γ)

Xt1(γ)

)−1],

σ2|y, γ ∼ IG

[n

2,yTy

2− cyTP1y

2(c+ 1)+βTXTP1Xβ

2(c+ 1)− yTP1Xβ

c+ 1

].

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64

Page 99: Bayesian model choice (and some alternatives)

Noninformative case

Use the same compatible informative g-prior distribution with β = 0p+1

and a hierarchical diffuse prior distribution on c,

π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1Ic>0

Recall g-prior

The choice of this hierarchical diffuse prior distribution on c is due to themodel posterior sensitivity to large values of c:

Taking β = 0p+1 and c large does not work

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64

Page 100: Bayesian model choice (and some alternatives)

Noninformative case

Use the same compatible informative g-prior distribution with β = 0p+1

and a hierarchical diffuse prior distribution on c,

π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1Ic>0

Recall g-prior

The choice of this hierarchical diffuse prior distribution on c is due to themodel posterior sensitivity to large values of c:

Taking β = 0p+1 and c large does not work

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64

Page 101: Bayesian model choice (and some alternatives)

Noninformative case

Use the same compatible informative g-prior distribution with β = 0p+1

and a hierarchical diffuse prior distribution on c,

π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1Ic>0

Recall g-prior

The choice of this hierarchical diffuse prior distribution on c is due to themodel posterior sensitivity to large values of c:

Taking β = 0p+1 and c large does not work

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64

Page 102: Bayesian model choice (and some alternatives)

Processionary caterpillar

Influence of some forest settlement characteristics on the development ofcaterpillar colonies

Response y log-transform of the average number of nests of caterpillarsper tree on an area of 500 square meters (n = 33 areas)

[ c©Marin & Robert, Bayesian Core]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64

Page 103: Bayesian model choice (and some alternatives)

Processionary caterpillar

Influence of some forest settlement characteristics on the development ofcaterpillar colonies

Response y log-transform of the average number of nests of caterpillarsper tree on an area of 500 square meters (n = 33 areas)

[ c©Marin & Robert, Bayesian Core]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64

Page 104: Bayesian model choice (and some alternatives)

Processionary caterpillar

Influence of some forest settlement characteristics on the development ofcaterpillar colonies

Response y log-transform of the average number of nests of caterpillarsper tree on an area of 500 square meters (n = 33 areas)

[ c©Marin & Robert, Bayesian Core]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64

Page 105: Bayesian model choice (and some alternatives)

Processionary caterpillar (cont’d)

Potential explanatory variables

x1 altitude (in meters), x2 slope (in degrees),

x3 number of pines in the square,

x4 height (in meters) of the tree at the center of the square,

x5 diameter of the tree at the center of the square,

x6 index of the settlement density,

x7 orientation of the square (from 1 if southb’d to 2 ow),

x8 height (in meters) of the dominant tree,

x9 number of vegetation strata,

x10 mix settlement index (from 1 if not mixed to 2 if mixed).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 52 / 64

Page 106: Bayesian model choice (and some alternatives)

Bayesian regression output

Estimate BF log10(BF)

(Intercept) 9.2714 26.334 1.4205 (***)X1 -0.0037 7.0839 0.8502 (**)X2 -0.0454 3.6850 0.5664 (**)X3 0.0573 0.4356 -0.3609X4 -1.0905 2.8314 0.4520 (*)X5 0.1953 2.5157 0.4007 (*)X6 -0.3008 0.3621 -0.4412X7 -0.2002 0.3627 -0.4404X8 0.1526 0.4589 -0.3383X9 -1.0835 0.9069 -0.0424X10 -0.3651 0.4132 -0.3838

evidence against H0: (****) decisive, (***) strong, (**) subtantial,(*) poor

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 53 / 64

Page 107: Bayesian model choice (and some alternatives)

Bayesian variable selection

t1(γ) π(γ|y,X)0,1,2,4,5 0.09290,1,2,4,5,9 0.03250,1,2,4,5,10 0.02950,1,2,4,5,7 0.02310,1,2,4,5,8 0.02280,1,2,4,5,6 0.02280,1,2,3,4,5 0.02240,1,2,3,4,5,9 0.01670,1,2,4,5,6,9 0.01670,1,2,4,5,8,9 0.0137

Noninformative G-prior model choice

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 54 / 64

Page 108: Bayesian model choice (and some alternatives)

Fringe alternatives

1 Introduction

2 Tests and model choice

3 Incoherent inferencesTempleton’s debateBayes/likelihood fusion

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 55 / 64

Page 109: Bayesian model choice (and some alternatives)

A revealing confusion

In statistics, coherent measures of fit of nested and overlapping compositehypotheses are technically those measures that are consistent with the constraints

of formal logic. For example, the probability of the nested special case must beless than or equal to the probability of the general model within which the specialcase is nested. Any statistic that assigns greater probability to the special case is

said to be incoherent.Templeton, PNAS, 2009

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 56 / 64

Page 110: Bayesian model choice (and some alternatives)

ABC algorithm

Instead of evaluating hypotheses in terms of how probable they say the data are,we evaluate them by estimating how accurately they’ll predict new data when

fitted to old — Seber, Evidence and Evolution, 2008

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ′ from the prior distribution π(·)generate z from the likelihood f(·|θ′)

until ρ{η(z), η(y)} ≤ εset θi = θ′

end for

where η(y) defines a (not necessarily sufficient) statistic[Pritchard et al., 1999]

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 57 / 64

Page 111: Bayesian model choice (and some alternatives)

ABC output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with a smalltolerance should provide a good approximation of the posteriordistribution:

πε(θ|y) =∫πε(θ, z|y)dz ≈ π(θ|y) .

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64

Page 112: Bayesian model choice (and some alternatives)

ABC output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z), η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with a smalltolerance should provide a good approximation of the posteriordistribution:

πε(θ|y) =∫πε(θ, z|y)dz ≈ π(θ|y) .

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64

Page 113: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testing

Against: Templeton, 2008, 2009,2010a, 2010b, 2010c argues thatnested hypotheses cannot havehigher probabilities than nestinghypotheses (!)

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 114: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testing

Against: Templeton, 2008, 2009,2010a, 2010b, 2010c argues thatnested hypotheses cannot havehigher probabilities than nestinghypotheses (!)

The probability of the nested specialcase must be less than or equal tothe probability of the general modelwithin which the special case isnested. Any statistic that assignsgreater probability to the special caseis incoherent. An example ofincoherence is shown for the ABCmethod.Templeton, PNAS, 2010

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 115: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testing

Against: Templeton, 2008, 2009,2010a, 2010b, 2010c argues thatnested hypotheses cannot havehigher probabilities than nestinghypotheses (!)

Incoherent methods, such as ABC,Bayes factor, or any simulationapproach that treats all hypothesesas mutually exclusive, should neverbe used with logically overlappinghypotheses.Templeton, PNAS, 2010

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 116: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testing

Against: Templeton, 2008, 2009,2010a, 2010b, 2010c argues thatnested hypotheses cannot havehigher probabilities than nestinghypotheses (!)

The central equation of ABC

P (Hi|H, S∗) =Gi(||Si − S∗||)ΠiPnj=1 Gj(||Sj − S∗||)Πj

is inherently incoherent. Thisfundamental equation ismathematically incorrect in everyinstance of overlap.Templeton, PNAS, 2010

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 117: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testing

Against: Templeton, 2008, 2009,2010a, 2010b, 2010c argues thatnested hypotheses cannot havehigher probabilities than nestinghypotheses (!)

Replies: Fagundes et al., 2008,Beaumont et al., 2010, Berger et al.,2010, Csillery et al., 2010 point outthat the criticisms are addressed at[Bayesian] model-based inference andhave nothing to do with ABC...

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 118: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testingABC is a statistically valid approach,

alongside other computationalstatistical techniques that have beensuccessfully used to infer parametersand compare models in populationgenetics.Beaumont et al., Molec. Ecology,2010

Replies: Fagundes et al., 2008,Beaumont et al., 2010, Berger et al.,2010, Csillery et al., 2010 point outthat the criticisms are addressed at[Bayesian] model-based inference andhave nothing to do with ABC...

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 119: Bayesian model choice (and some alternatives)

The ”Great ABC controversy”

On-going controvery in phylogeographic genetics about the validity ofusing ABC for testingThe confusion seems to arise from

misunderstanding the differencebetween scientific hypotheses andtheir mathematical representation.Bayes’ theorem shows that thesimpler model can indeed have amuch higher posterior probability.Berger et al., PNAS, 2010

Replies: Fagundes et al., 2008,Beaumont et al., 2010, Berger et al.,2010, Csillery et al., 2010 point outthat the criticisms are addressed at[Bayesian] model-based inference andhave nothing to do with ABC...

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64

Page 120: Bayesian model choice (and some alternatives)

Aitkin’s alternative

Without a specific alternative, the best we can do is tomake posterior probability statements about µ and transfer

these to the posterior distribution of the likelihood ratio.Aitkin, Statistical Inference, 2010

Proposal to examine the posterior distribution of the likelihood function :compare models via the “posterior distribution” of the likelihood ratio.

L1(θ1|x)/L2(θ2|x) ,

with θ1 ∼ π1(θ1|x) and θ2 ∼ π2(θ2|x).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64

Page 121: Bayesian model choice (and some alternatives)

Aitkin’s alternative

Without a specific alternative, the best we can do is tomake posterior probability statements about µ and transfer

these to the posterior distribution of the likelihood ratio.Aitkin, Statistical Inference, 2010

Proposal to examine the posterior distribution of the likelihood function :compare models via the “posterior distribution” of the likelihood ratio.

L1(θ1|x)/L2(θ2|x) ,

with θ1 ∼ π1(θ1|x) and θ2 ∼ π2(θ2|x).

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64

Page 122: Bayesian model choice (and some alternatives)

Using the data “twice”

A persistent criticism of the posterior likelihood approach has been basedon the claim that these approaches are ‘using the data twice’, or are‘violating temporal coherence’ — Aitkin, Statistical Inference, 2010

Complete separation between both models due to simulation underproduct of the posterior distributions, i.e. replaces standard Bayesianinference under joint posterior of (θ1, θ2),

p1m1(x)π1(θ1|x)π2(θ2) + p2m2(x)π2(θ2|x)π1(θ1)

by product of both posteriors

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 61 / 64

Page 123: Bayesian model choice (and some alternatives)

Illustration

Comparison of a Poisson model against a negative binomial with m = 5successes, when x = 3:

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 62 / 64

Page 124: Bayesian model choice (and some alternatives)

Pros ...

This quite small change to standard Bayesian analysis allows a verygeneral approach to a wide range of apparently different inference

problems; a particular advantage of the approach is that it can use thesame noninformative priors — Aitkin, Statistical Inference, 2010

the approach is general and allows to resolve the difficulties with theBayesian processing of point null hypotheses;

the approach allows for the use of generic noninformative andimproper priors;

the approach handles more naturally the “vexed question of modelfit”;

the approach is “simple”.

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 63 / 64

Page 125: Bayesian model choice (and some alternatives)

... & cons

The p-value is equal to the posterior probability that the likelihood ratio,for null hypothesis to alternative, is greater than 1 (...) The posterior

probability is p that the posterior probability of H0 is greater than 0.5.Aitkin, Statistical Inference, 2010

the approach is not Bayesian (product of the posteriors)

the approach uses undeterminate entities (“posterior probability thatthe posterior probability is larger than 0.5”...)

the approach tries to get as close as possible to the p-value

Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 64 / 64