reading "bayesian measures of model complexity and fit"

IntroductionComplexity

Forms for pDDiagnostics for fit

Model comparison criterionExamples

Conclusion

Bayesian measures of model complexityand fit

by D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van derLinde, 2002

presented by Ilaria Masiani

TSI-EuroBayes studentUniversité Paris Dauphine

Reading seminar on Classics, October 21, 2013

Ilaria Masiani October 21, 2013

Conclusion

Presentation of the paper

Bayesian measures of model complexity and fit by David J.Spiegelhalter, Nicola G. Best, Bradley P. Carlin andAngelika van der LindePublished in 2002 for J. Royal Statistical Society, series B,vol.64, Part 4, pp. 583-639

Conclusion

Outline

1 Introduction

2 Complexity of a Bayesian model

3 Forms for pD

4 Diagnostics for fit

5 Model comparison criterion

6 Examples

7 Conclusion

Conclusion

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Conclusion

Introduction

Model comparison:measure of fit (ex. deviance statistic)complexity (n. of free parameters in the model)

=⇒Trade-off of these two quantities

Conclusion

Some of usual model comparison criterion:Akaike information criterion: AIC= −2log{p(y |θ)}+ 2pBayesian information criterion:BIC= −2log{p(y |θ)}+ plog(n)

The problem: both require to know p

Sometimes not clearly defined, e.g., complex hierarchicalmodels

Conclusion

=⇒This paper suggests Bayesian measures of complexity andfit that can be combined to compare complex models.

ConclusionBayesian measure of modelcomplexity

Observations on pD

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Observations on pD

Complexity reflects the ’difficulty in estimation’.

Measure of complexity may depend on:prior informationobserved data

Observations on pD

True model

’All models are wrong, but some are useful’Box (1976)

Observations on pD

True model

pt (Y ) ’true’ distribution of unobserved future data Yθt ’pseudotrue’ parameter valuep(Y |θt ) likelihood specified by θt

Observations on pD

Residual information

residual information in data y conditional on θ:

−2log{p(y |θ)}

up to a multiplicative constant (Kullback and Leibler, 1951)estimator θ(y) of θt

excess of the true over the estimated residual information:

dΘ{y , θt , θ(y)} = −2log{p(y |θt )}+ 2log[p{y |θ(y)}]

Observations on pD

−2log{p(y |θ)}

Observations on pD

−2log{p(y |θ)}

Observations on pD

Outline

1 Introduction

2 Complexity of a Bayesian modelBayesian measure of model complexity

3 Forms for pD

6 Examples

7 Conclusion

Observations on pD

Bayesian measure of model complexity

unknown θt replaced by random variable θdΘ{y , θ, θ(y)} estimated by its posterior expectation w.r.t.p(θ|y) :

pD{y ,Θ, θ(y)} = Eθ|y [dΘ{y , θ, θ(y)}]= Eθ|y [−2log{p(y |θ)}] + 2log[p{y |θ(y)}]

pD proposal as the effective number of parameters w.r.t.model with focus Θ

Observations on pD

Effective number of parameters

tipically θ(y) = E(θ|y) = θ.f (y) fully specified standardizing term, function of the data

Definition

pD = D(θ)− D(θ) (1)

whereD(θ) = −2log{p(y |θ)}+ 2log{f (y)}

is the ’Bayesian deviance’.

Observations on pD

Effective number of parameters

tipically θ(y) = E(θ|y) = θ.f (y) fully specified standardizing term, function of the data

Definition

pD = D(θ)− D(θ) (1)

whereD(θ) = −2log{p(y |θ)}+ 2log{f (y)}

is the ’Bayesian deviance’.

Observations on pD

Outline

1 Introduction

2 Complexity of a Bayesian modelObservations on pD

3 Forms for pD

6 Examples

7 Conclusion

Observations on pD

1 (1) can be rewritten as D(θ) = D(θ) + pD =⇒ measure of’adeguacy’

2 pD depends on: data, choice of focus Θ, prior info, choiceof θ(y) =⇒ lack of invariance to tranformations

3 using θ(y) = E(θ|y), pD ≥ 0 for any log-concave likelihoodin θ (Jensen’s inequality) =⇒ negative pDs indicate conflictbetween prior and data

4 pD easily calculated after a MCMC run

Observations on pD

ConclusionpD for approximately normallikelihoodspD for normal likelihoods

pD for exponential family likeli-hoods

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Outline

1 Introduction

3 Forms for pDpD for approximately normal likelihoods

6 Examples

7 Conclusion

Negligible prior informations

Assume θ|y ∼ N(θ,−L′′

θ), then expanding D(θ) around θ

D(θ) ≈ D(θ)− (θ − θ)T L′′

θ(θ − θ)

≈ D(θ) + χ2p

=⇒pD = Eθ|y{D(θ)} − D(θ) ≈ p (2)

Negligible prior informations

Assume θ|y ∼ N(θ,−L′′

θ), then expanding D(θ) around θ

D(θ) ≈ D(θ)− (θ − θ)T L′′

θ(θ − θ)

≈ D(θ) + χ2p

=⇒pD = Eθ|y{D(θ)} − D(θ) ≈ p (2)

Outline

1 Introduction

3 Forms for pDpD for normal likelihoods

6 Examples

7 Conclusion

General hierarchical normal model (know variance)

y ∼ N(A1θ,C1)

θ ∼ N(A2φ,C2)

Then θ|y is normal with mean θ = Vb and covariance V .

=⇒pD = tr(−L

′′V )

where −L′′

= AT1 C−1

1 A1 is the Fisher information.

In this case, pD is invariant to affine tranformations of θ.

y ∼ N(A1θ,C1)

θ ∼ N(A2φ,C2)

=⇒pD = tr(−L

′′V )

where −L′′

= AT1 C−1

y ∼ N(A1θ,C1)

θ ∼ N(A2φ,C2)

=⇒pD = tr(−L

′′V )

where −L′′

= AT1 C−1

In normal models:y = Hy , with H hat matrix (that projects the data onto thefitted values) =⇒ H = A1VAT

1 C−11

ThenpD = tr(H)

tr(H) = sum of leverages (influence of each observationon its fitted value)

Conjugate normal-gamma model (unknow precision τ )

y ∼ N(A1θ, τ−1C1)

θ ∼ N(A2φ, τ−1C2)

pD = tr(H) + q(θ)(τ − τ)− n{log(τ)− log(τ)}

where q(θ) = (y − A1θ)T C−11 (y − A1θ).

It can be shown that for large n the choice of parameterizationof τ will make little difference to pD.

y ∼ N(A1θ, τ−1C1)

θ ∼ N(A2φ, τ−1C2)

y ∼ N(A1θ, τ−1C1)

θ ∼ N(A2φ, τ−1C2)

Outline

1 Introduction

3 Forms for pDpD for exponential family likelihoods

6 Examples

7 Conclusion

One-parameter exponential family

DefinitionAssume to have p groups of observations, each of niobservations in group i has same distribution.For j th observation in i th group:

log{p(yij |θi , φ)} = wi{yijθi − b(θi)}/φ+ c(yij , φ)

whereµi = E(Yij |θi , φ) = b

′(θi)

V (Yij |θi , φ) = b′′

(θi)φ/wi

wi constant.

If Θ focus, bi = Eθi |y{b(θi)}, then the contribution of i th group tothe effective number of parameters:

= 2niwi{bi − b(θi)}/φ

=⇒ lack of invariance of pD to reparametrization

If Θ focus, bi = Eθi |y{b(θi)}, then the contribution of i th group tothe effective number of parameters:

= 2niwi{bi − b(θi)}/φ

=⇒ lack of invariance of pD to reparametrization

Conclusion

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Conclusion

Sampling theory diagnostics for lack of Bayesian fit

Eθ|y{D(θ)} = D(θ) measure of fit or ’adeguacy’If the model is true

EY (D) = EY [Eθ|y{D(θ)}]

= Eθ(EY |θ[−2logp(Y |θ)

p{Y |θ(Y )}])

≈ Eθ[EY |θ(χ2p)]

= Eθ(p) = p

For one-parameter exponential family p = n, thenEY (D) ≈ n

Conclusion

p{Y |θ(Y )}])

= Eθ(p) = p

Conclusion

p{Y |θ(Y )}])

= Eθ(p) = p

ConclusionDefinition of the problemClassical criteria for modelcomparison

Bayesian criteria for modelcomparison

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Outline

1 Introduction

3 Forms for pD

5 Model comparison criterionDefinition of the problem

6 Examples

7 Conclusion

Model comparison: the problem

Yrep = independent replicate data setL(Y , θ) = loss in assigning to data Y a probability p(Y |θ)

L(y , θ(y)) = ’apparent’ loss repredicting the observed y

EYrep|θt [L{y , θ(y)}] = L{y , θ(y)}+ cΘ{y , θt , θ(y)}

where cΘ is the ’optimism’ associated with the estimator θ(y)(Efron, 1986)

Model comparison: the problem

Yrep = independent replicate data setL(Y , θ) = loss in assigning to data Y a probability p(Y |θ)

L(y , θ(y)) = ’apparent’ loss repredicting the observed y

EYrep|θt [L{y , θ(y)}] = L{y , θ(y)}+ cΘ{y , θt , θ(y)}

where cΘ is the ’optimism’ associated with the estimator θ(y)(Efron, 1986)

Assuming L(Y , θ) = −2log{p(Y |θ)},to estimate cΘ:

1 Classical approach: attempts to estimate the samplingexpectation of cΘ

2 Bayesian approach: direct calculation of the posteriorexpectation of cΘ

Outline

1 Introduction

3 Forms for pD

5 Model comparison criterionClassical criteria for model comparison

6 Examples

7 Conclusion

Expected optimism: π(θt ) = EY |θt [cΘ{Y , θt , θ(Y )}]All criteria for models comparison based on minimizing

EYrep|θt [L{Yrep, θ(y)}] = L{y , θ(y)}+ π(θt )

Efron (1986) π(θt ) for the log-loss function: πE (θt ) ≈ 2pConsidered as corresponding to a plug-in estimate of fit +twice the effective number of parameters in the model

Outline

1 Introduction

3 Forms for pD

5 Model comparison criterionBayesian criteria for model comparison

6 Examples

7 Conclusion

AIME: identify models that best explain the observed databut

with the expectation that they minimize uncertainty aboutobservations generated in the same way

Deviance information criterion (DIC)

Definition

DIC = D(θ) + 2pD

= D + pD

Classical estimate of fit + twice the effective number ofparametersAlso a Bayesian measure of fit, penalized by complexity pD

DIC and AIC

Akaike information criterion=⇒ AIC= 2p − 2log{p(y |θ)}θ =MLE

From result (2): pD ≈ p in models with negligible priorinformation =⇒ DIC≈ 2p + D(θ)

Conclusion Spatial distribution of lip cancer Six-cities study

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Outline

1 Introduction

3 Forms for pD

6 ExamplesSpatial distribution of lip cancer in Scotland

7 Conclusion

Data on the rates of lip cancer in 56 districts in Scotland(Clayton and Kaldor, 1987; Breslow and Clayton, 1993)

yi observed numbers of cases for each county iEi expected numbers of cases for each county iAi list for each county of its ni adjacent counties

yi ∼ Pois(exp{θi}Ei)

exp{θi} underlying true area-specific relative risk of lip cancer

Data on the rates of lip cancer in 56 districts in Scotland(Clayton and Kaldor, 1987; Breslow and Clayton, 1993)

yi observed numbers of cases for each county iEi expected numbers of cases for each county iAi list for each county of its ni adjacent counties

yi ∼ Pois(exp{θi}Ei)

exp{θi} underlying true area-specific relative risk of lip cancer

Candidate models for θi

Model 1: θi = α0 (pooled)Model 2: θi = α0 + γi (exchangeable random effect)Model 3: θi = α0 + δi (spatial random effect)Model 4: θi = α0 + γi + δi (exchang.+ spatial effects)Model 5: θi = αi (saturated)

Priors

α0 improper uniform priorαi (i = 1, ...,56) normal priors with large varianceγi ∼ N(0, λ−1

δi |δ\i ∼ N(

∑j∈Ai

δj ,1

niλδ

∑56i=1 δi = 0

conditional autoregressive prior (Besag, 1974)λγ , λδ ∼ Gamma(0.5,0.0005)

Saturated deviance

D(θ) = 2∑

[yi log{yi/exp(θi)Ei} − {yi − exp(θi)Ei}]

(McCullagh and Nelder, 1989, pg 34)

obtained by taking as standardizing factor:−2log{f (y)} = −2

∑i log{p(yi |θi)} = 208.0

Results

For each model, two independent chains of MCMC (WinBUGS)for 15000 iterations each (burn-in after 5000 it.)

Deviance summaries using three alternative parameterizations(mean, canonical, median).

Deviance calculations

D mean of the posterior samples of the saturated devianceD(µ) by plugging the posterior mean of µi = exp(θi)Ei intothe saturated devianceD(θ) by plugging the posterior means of α0, αi , γi , δi intothe linear predictor θi

D(med) by plugging the posterior median of θi into thesaturated deviance

Observations on pDs results

From result (2): pD ≈ ppooled model 1: pD = 1.0saturated model 5: pD from 52.8 to 55.9models 3-4 with spatial random effects: pD around 31model 2 with only exchangeable random effects: pDaround 43

Comparison of DIC

DIC subject to Monte Carlo sampling error (function ofstochastic quantities)

Either of models 3 or 4 is superior to the others

Models 2 and 5 are superior to model 1

Absolute measure of fit: compare D with n = 56

All models (except pooled model 1) adequate overall fit to thedata =⇒ comparison essentially based on pDs

Absolute measure of fit: compare D with n = 56

All models (except pooled model 1) adequate overall fit to thedata =⇒ comparison essentially based on pDs

Outline

1 Introduction

3 Forms for pD

6 ExamplesSix-cities study

7 Conclusion

Subset of data from the six-cities study: longitudinal study ofhealth effects of air pollution (Fitzmaurice and Laird, 1993)

yij repeated binary measurement of the wheezing status ofchild i at time j (1, yes; 0, no), i = 1, ..., I, j = 1, ..., JI = 537 children living in Stuebenville, OhioJ = 4 time pointsaij age of child i in years at measurement point j (7, 8, 9,10 years)si smoking status of child i ’s mother (1, yes; 0, no)

Subset of data from the six-cities study: longitudinal study ofhealth effects of air pollution (Fitzmaurice and Laird, 1993)

yij repeated binary measurement of the wheezing status ofchild i at time j (1, yes; 0, no), i = 1, ..., I, j = 1, ..., JI = 537 children living in Stuebenville, OhioJ = 4 time pointsaij age of child i in years at measurement point j (7, 8, 9,10 years)si smoking status of child i ’s mother (1, yes; 0, no)

Conditional response model

Yij ∼ Bernoulli(pij)

pij = Pr(Yij = 1) = g−1(µij)

µij = β0 + β1zij1 + β2zij2 + β3zij3 + bi

zijk = xijk − x ..k , k = 1,2,3xij1 = aij , xij2 = si , xij3 = aijsi

bi individual-specific random effects: bi ∼ N(0, λ−1)

Model choice: link function g(·)

Model 1: g(pij) = logit(pij) = log{pij/(1− pij)}

Model 2: g(pij) = probit(pij) = Φ−1(pij)

Model 3: g(pij) = cloglog(pij) = log{−log(1− pij)}

Priors and deviance form

βk flat priorsλ ∼ Gamma(0.001,0.001)

D = −2∑i,j

{yij log(pij) + (1− yij)log(1− pij)}

Results

Gibbs sampler for 5000 iterations (burn-in after 1000 it.)

Deviance summaries for canonical and meanparameterizations.

Conclusion

Outline

1 Introduction

3 Forms for pD

6 Examples

7 Conclusion

Conclusion

pD may not be invariant to the chosen parametrizationSimilarities to frequentist measures but based onexpectations w.r.t. parameters, in place of samplingexpectationsDIC viewed as a Bayesian analogue of AIC, similarjustification but wider applicabilityInvolves Monte Carlo sampling and negligible analytic work

Appendix References

References I

McCullagh, P. and Nelder, J.Generalized Linear Models.2nd edn. London: Chapman and Hall, 1989.

Besag, J.Spatial interaction and the statistical analysis of latticesystems.J. R. Statist. Soc., series B, 36, 192-236, 1974.

Clayton, D.G. and Kaldor, J.Empirical Bayes estimates of age-standardised relative riskfor use in disease mapping.Biometrics, 43, 671-681, 1987.

Appendix References

References II

Efron, B.How biased is the apparent error rate of a prediction rule?J. Ann. Statistic. Ass., 81, 461-470, 1986.

Fitzmaurice, G. and Laird, N.A likelihood-based method for analysing longitudinal binaryresponses.Biometrika, 80, 141-151, 1993.

Kullback, S. and Leibler, R.A.On information and sufficienty.Ann. Math. Statist., 22, 79-86, 1951.

Appendix References

References III

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van derLinde, A.Bayesian measures of model complexity and fit.J. Royal Statistical Society, series B, vol.64, Part 4, pp.583-639, 2002.

Appendix References

Thank you.

Questions?

reading "bayesian measures of model complexity and fit"

Education

pd diagnostics

introduction complexity

measure of complexity

measure of t

akaike information criterion

estimator y of t excess

data y conditional

tilaria masianioctober