bayesian analysis ii

84
Introduction to Bayesian Inference Lecture 2: Implemention Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/bayes/ CASt Summer School — 11 June 2010 1 / 83

Upload: others

Post on 10-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Analysis II

Introduction to Bayesian Inference

Lecture 2: Implemention

Tom LoredoDept. of Astronomy, Cornell University

http://www.astro.cornell.edu/staff/loredo/bayes/

CASt Summer School — 11 June 2010

1 / 83

Page 2: Bayesian Analysis II

Lecture 2: Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

2 / 83

Page 3: Bayesian Analysis II

Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

3 / 83

Page 4: Bayesian Analysis II

Binary Outcomes:

Parameter Estimation

M = Existence of two outcomes, S and F ; for each case or trial,the probability for S is α; for F it is (1 − α)

Hi = Statements about α, the probability for success on the nexttrial → seek p(α|D,M)

D = Sequence of results from N observed trials:

FFSSSSFSSSFS (n = 8 successes in N = 12 trials)

Likelihood:

p(D|α,M) = p(failure|α,M) × p(failure|α,M) × · · ·= αn(1 − α)N−n

= L(α)

4 / 83

Page 5: Bayesian Analysis II

PriorStarting with no information about α beyond its definition,use as an “uninformative” prior p(α|M) = 1. Justifications:

• Intuition: Don’t prefer any α interval to any other of same size• Bayes’s justification: “Ignorance” means that before doing the

N trials, we have no preference for how many will be successes:

P(n success|M) =1

N + 1→ p(α|M) = 1

Consider this a convention—an assumption added to M tomake the problem well posed.

5 / 83

Page 6: Bayesian Analysis II

Prior Predictive

p(D|M) =

∫dα αn(1 − α)N−n

= B(n + 1,N − n + 1) =n!(N − n)!

(N + 1)!

A Beta integral, B(a, b) ≡∫

dx xa−1(1 − x)b−1 = Γ(a)Γ(b)Γ(a+b) .

6 / 83

Page 7: Bayesian Analysis II

Posterior

p(α|D,M) =(N + 1)!

n!(N − n)!αn(1 − α)N−n

A Beta distribution. Summaries:

• Best-fit: α = nN

= 2/3; 〈α〉 = n+1N+2 ≈ 0.64

• Uncertainty: σα =√

(n+1)(N−n+1)(N+2)2(N+3)

≈ 0.12

Find credible regions numerically, or with incomplete betafunction

Note that the posterior depends on the data only through n,not the N binary numbers describing the sequence.

n is a (minimal) Sufficient Statistic.

7 / 83

Page 8: Bayesian Analysis II

8 / 83

Page 9: Bayesian Analysis II

Binary Outcomes: Model ComparisonEqual Probabilities?

M1: α = 1/2M2: α ∈ [0, 1] with flat prior.

Maximum Likelihoods

M1 : p(D|M1) =1

2N= 2.44 × 10−4

M2 : L(α) =

(2

3

)n (1

3

)N−n

= 4.82 × 10−4

p(D|M1)

p(D|α,M2)= 0.51

Maximum likelihoods favor M2 (failures more probable).

9 / 83

Page 10: Bayesian Analysis II

Bayes Factor (ratio of model likelihoods)

p(D|M1) =1

2N; and p(D|M2) =

n!(N − n)!

(N + 1)!

→ B12 ≡ p(D|M1)

p(D|M2)=

(N + 1)!

n!(N − n)!2N

= 1.57

Bayes factor (odds) favors M1 (equiprobable).

Note that for n = 6, B12 = 2.93; for this small amount ofdata, we can never be very sure results are equiprobable.

If n = 0, B12 ≈ 1/315; if n = 2, B12 ≈ 1/4.8; for extremedata, 12 flips can be enough to lead us to strongly suspectoutcomes have different probabilities.

(Frequentist significance tests can reject null for any sample size.)

10 / 83

Page 11: Bayesian Analysis II

Binary Outcomes: Binomial DistributionSuppose D = n (number of heads in N trials), rather than theactual sequence. What is p(α|n,M)?

LikelihoodLet S = a sequence of flips with n heads.

p(n|α,M) =∑

Sp(S |α,M) p(n|S, α,M)

αn (1 − α)N−n

[ # successes = n]

= αn(1 − α)N−nCn,N

Cn,N = # of sequences of length N with n heads.

→ p(n|α,M) =N!

n!(N − n)!αn(1 − α)N−n

The binomial distribution for n given α, N.11 / 83

Page 12: Bayesian Analysis II

Posterior

p(α|n,M) =

N!n!(N−n)!α

n(1 − α)N−n

p(n|M)

p(n|M) =N!

n!(N − n)!

∫dα αn(1 − α)N−n

=1

N + 1

→ p(α|n,M) =(N + 1)!

n!(N − n)!αn(1 − α)N−n

Same result as when data specified the actual sequence.

12 / 83

Page 13: Bayesian Analysis II

Another Variation: Negative Binomial

Suppose D = N, the number of trials it took to obtain a predifinednumber of successes, n = 8. What is p(α|N,M)?

Likelihoodp(N|α,M) is probability for n − 1 successes in N − 1 trials,times probability that the final trial is a success:

p(N|α,M) =(N − 1)!

(n − 1)!(N − n)!αn−1(1 − α)N−nα

=(N − 1)!

(n − 1)!(N − n)!αn(1 − α)N−n

The negative binomial distribution for N given α, n.

13 / 83

Page 14: Bayesian Analysis II

Posterior

p(α|D,M) = C ′n,N

αn(1 − α)N−n

p(D|M)

p(D|M) = C ′n,N

∫dα αn(1 − α)N−n

→ p(α|D,M) =(N + 1)!

n!(N − n)!αn(1 − α)N−n

Same result as other cases.

14 / 83

Page 15: Bayesian Analysis II

Final Variation: “Meteorological Stopping”

Suppose D = (N, n), the number of samples and number ofsuccesses in an observing run whose total number was determinedby the weather at the telescope. What is p(α|D,M ′)?

(M ′ adds info about weather to M.)

Likelihoodp(D|α,M ′) is the binomial distribution times the probabilitythat the weather allowed N samples, W (N):

p(D|α,M ′) = W (N)N!

n!(N − n)!αn(1 − α)N−n

Let Cn,N = W (N)(Nn

). We get the same result as before!

15 / 83

Page 16: Bayesian Analysis II

Likelihood Principle

To define L(Hi ) = p(Dobs|Hi , I ), we must contemplate what otherdata we might have obtained. But the “real” sample space may bedetermined by many complicated, seemingly irrelevant factors; itmay not be well-specified at all. Should this concern us?

Likelihood principle: The result of inferences depends only on howp(Dobs|Hi , I ) varies w.r.t. hypotheses. We can ignore aspects of theobserving/sampling procedure that do not affect this dependence.

This happens because no sums of probabilities for hypotheticaldata appear in Bayesian results; Bayesian calculations condition on

Dobs.

This is a sensible property that frequentist methods do not share.Frequentist probabilities are “long run” rates of performance, anddepend on details of the sample space that are irrelevant in aBayesian calculation.

16 / 83

Page 17: Bayesian Analysis II

Goodness-of-fit Violates the Likelihood Principle

Theory (H0)The number of “A” stars in a cluster should be 0.1 of thetotal.

Observations5 A stars found out of 96 total stars observed.

Theorist’s analysisCalculate χ2 using nA = 9.6 and nX = 86.4.

Significance level is p(> χ2|H0) = 0.12 (or 0.07 using morerigorous binomial tail area). Theory is accepted.

17 / 83

Page 18: Bayesian Analysis II

Observer’s analysisActual observing plan was to keep observing until 5 A starsseen!

“Random” quantity is Ntot, not nA; it should follow thenegative binomial dist’n. Expect Ntot = 50 ± 21.

p(> χ2|H0) = 0.03. Theory is rejected.

Telescope technician’s analysisA storm was coming in, so the observations would have endedwhether 5 A stars had been seen or not. The proper ensembleshould take into account p(storm) . . .

Bayesian analysisThe Bayes factor is the same for binomial or negative binomiallikelihoods, and slightly favors H0. Include p(storm) if youwant—it will drop out!

18 / 83

Page 19: Bayesian Analysis II

Inference With Normals/Gaussians

Gaussian PDF

p(x |µ, σ) =1

σ√

2πe− (x−µ)2

2σ2 over [−∞,∞]

Common abbreviated notation: x ∼ N(µ, σ2)

Parameters

µ = 〈x〉 ≡∫

dx x p(x |µ, σ)

σ2 = 〈(x − µ)2〉 ≡∫

dx (x − µ)2 p(x |µ, σ)

19 / 83

Page 20: Bayesian Analysis II

Gauss’s Observation: Sufficiency

Suppose our data consist of N measurements, di = µ+ ǫi .Suppose the noise contributions are independent, andǫi ∼ N (0, σ2).

p(D|µ, σ,M) =∏

i

p(di |µ, σ,M)

=∏

i

p(ǫi = di − µ|µ, σ,M)

=∏

i

1

σ√

2πexp

[−(di − µ)2

2σ2

]

=1

σN(2π)N/2e−Q(µ)/2σ2

20 / 83

Page 21: Bayesian Analysis II

Find dependence of Q on µ by completing the square:

Q =∑

i

(di − µ)2 [Note: Q/σ2 = χ2(µ)]

=∑

i

d2i +

i

µ2 − 2∑

i

diµ

=

(∑

i

d2i

)+ Nµ2 − 2Nµd where d ≡ 1

N

i

di

= N(µ− d)2 +

(∑

i

d2i

)− Nd

2

= N(µ− d)2 + Nr2 where r2 ≡ 1

N

i

(di − d)2

21 / 83

Page 22: Bayesian Analysis II

Likelihood depends on {di} only through d and r :

L(µ, σ) =1

σN(2π)N/2exp

(−Nr2

2σ2

)exp

(−N(µ− d)2

2σ2

)

The sample mean and variance are sufficient statistics.

This is a miraculous compression of information—the normal dist’nis highly abnormal in this respect!

22 / 83

Page 23: Bayesian Analysis II

Estimating a Normal Mean

Problem specificationModel: di = µ+ ǫi , ǫi ∼ N(0, σ2), σ is known → I = (σ,M).

Parameter space: µ; seek p(µ|D, σ,M)

Likelihood

p(D|µ, σ,M) =1

σN(2π)N/2exp

(−Nr2

2σ2

)exp

(−N(µ− d)2

2σ2

)

∝ exp

(−N(µ− d)2

2σ2

)

23 / 83

Page 24: Bayesian Analysis II

“Uninformative” priorTranslation invariance ⇒ p(µ) ∝ C , a constant.This prior is improper unless bounded.

Prior predictive/normalization

p(D|σ,M) =

∫dµ C exp

(−N(µ− d)2

2σ2

)

= C (σ/√

N)√

. . . minus a tiny bit from tails, using a proper prior.

24 / 83

Page 25: Bayesian Analysis II

Posterior

p(µ|D, σ,M) =1

(σ/√

N)√

2πexp

(−N(µ− d)2

2σ2

)

Posterior is N(d ,w2), with standard deviation w = σ/√

N .

68.3% HPD credible region for µ is d ± σ/√

N.

Note that C drops out → limit of infinite prior range is wellbehaved.

25 / 83

Page 26: Bayesian Analysis II

Informative Conjugate PriorUse a normal prior, µ ∼ N(µ0,w

20 ).

Conjugate because the posterior is also normal.

PosteriorNormal N(µ, w2), but mean, std. deviation “shrink” towardsprior.

Define B = w2

w2+w20, so B < 1 and B = 0 when w0 is large.

Then

µ = d + B · (µ0 − d)

w = w ·√

1 − B

“Principle of stable estimation” — The prior affects estimatesonly when data are not informative relative to prior.

26 / 83

Page 27: Bayesian Analysis II

Conjugate normal examples:

• Data have d = 3, σ/√

N = 1

• Priors at µ0 = 10, with w = {5, 2}

0 5 10 15 20x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

p(x|

�) Prior

LPost.

0 5 10 15 20x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

p(x|

�)27 / 83

Page 28: Bayesian Analysis II

Estimating a Normal Mean: Unknown σ

Problem specificationModel: di = µ+ ǫi , ǫi ∼ N(0, σ2), σ is unknown

Parameter space: (µ, σ); seek p(µ|D,M)

Likelihood

p(D|µ, σ,M) =1

σN(2π)N/2exp

(−Nr2

2σ2

)exp

(−N(µ− d)2

2σ2

)

∝ 1

σNe−Q/2σ2

where Q = N[r2 + (µ− d)2

]

28 / 83

Page 29: Bayesian Analysis II

Uninformative PriorsAssume priors for µ and σ are independent.

Translation invariance ⇒ p(µ) ∝ C , a constant.

Scale invariance ⇒ p(σ) ∝ 1/σ (flat in log σ).

Joint Posterior for µ, σ

p(µ, σ|D,M) ∝ 1

σN+1e−Q(µ)/2σ2

29 / 83

Page 30: Bayesian Analysis II

Marginal Posterior

p(µ|D,M) ∝∫

dσ1

σN+1e−Q/2σ2

Let τ = Q2σ2 so σ =

√Q2τ and |dσ| = τ−3/2

√Q2 dτ

⇒ p(µ|D,M) ∝ 2N/2Q−N/2

∫dτ τ

N2−1e−τ

∝ Q−N/2

30 / 83

Page 31: Bayesian Analysis II

Write Q = Nr2

[1 +

(µ−d

r

)2]

and normalize:

p(µ|D,M) =

(N2 − 1

)!(

N2 − 3

2

)!√π

1

r

[1 +

1

N

(µ− d

r/√

N

)2]−N/2

“Student’s t distribution,” with t = (µ−d)

r/√

N

A “bell curve,” but with power-law tails

Large N:

p(µ|D,M) ∼ e−N(µ−d)2/2r2

This is the rigorous way to “adjust σ so χ2/dof = 1.”

It doesn’t just plug in a best σ; it slightly broadens posteriorto account for σ uncertainty.

31 / 83

Page 32: Bayesian Analysis II

Student t examples:

• p(x) ∝ 1“

1+ x2

n

”n+12

• Location = 0, scale = 1

• Degrees of freedom = {1, 2, 3, 5, 10,∞}

−4 −2 0 2 4x

0.0

0.1

0.2

0.3

0.4

0.5

p(x

)

1

2

3

5

10

normal

32 / 83

Page 33: Bayesian Analysis II

Gaussian Background Subtraction

Measure background rate b = b ± σb with source off.

Measure total rate r = r ± σr with source on.

Infer signal source strength s, where r = s + b.

With flat priors,

p(s, b|D,M) ∝ exp

[−(b − b)2

2σ2b

]× exp

[−(s + b − r)2

2σ2r

]

33 / 83

Page 34: Bayesian Analysis II

Marginalize b to summarize the results for s (complete the squareto isolate b dependence; then do a simple Gaussian integral overb):

p(s|D,M) ∝ exp

[−(s − s)2

2σ2s

]s = r − b

σ2s = σ2

r + σ2b

⇒ Background subtraction is a special case of backgroundmarginalization; i.e., marginalization “told us” to subtract abackground estimate.

Recall the standard derivation of background uncertainty via“propagation of errors” based on Taylor expansion (statistician’sDelta-method).

Marginalization provides a generalization of errorpropagation—without approximation!

34 / 83

Page 35: Bayesian Analysis II

Bayesian Curve Fitting & Least SquaresSetup

Data D = {di} are measurements of an underlying functionf (x ; θ) at N sample points {xi}. Let fi(θ) ≡ f (xi ; θ):

di = fi(θ) + ǫi , ǫi ∼ N(0, σ2i )

We seek learn θ, or to compare different functional forms(model choice, M).

Likelihood

p(D|θ,M) =N∏

i=1

1

σi

√2π

exp

[−1

2

(di − fi(θ)

σi

)2]

∝ exp

[−1

2

i

(di − fi(θ)

σi

)2]

= exp

[−χ

2(θ)

2

]

35 / 83

Page 36: Bayesian Analysis II

Bayesian Curve Fitting & Least Squares

PosteriorFor prior density π(θ),

p(θ|D,M) ∝ π(θ) exp

[−χ

2(θ)

2

]

If you have a least-squares or χ2 code:

• Think of χ2(θ) as −2 logL(θ).

• Bayesian inference amounts to exploration and numericalintegration of π(θ)e−χ2(θ)/2.

36 / 83

Page 37: Bayesian Analysis II

Important Case: Separable Nonlinear ModelsA (linearly) separable model has parameters θ = (A, ψ):

• Linear amplitudes A = {Aα}

• Nonlinear parameters ψ

f (x ; θ) is a linear superposition of M nonlinear componentsgα(x ;ψ):

di =M∑

α=1

Aαgα(xi ;ψ) + ǫi

or ~d =∑

α

Aα~gα(ψ) + ~ǫ.

Why this is important: You can marginalize over A analytically

→ Bretthorst algorithm (“Bayesian Spectrum Analysis & Param. Est’n” 1988)

Algorithm is closely related to linear leastsquares/diagonalization/SVD.

37 / 83

Page 38: Bayesian Analysis II

Poisson Dist’n: Infer a Rate from Counts

Problem:Observe n counts in T ; infer rate, r

Likelihood

L(r) ≡ p(n|r ,M) = p(n|r ,M) =(rT )n

n!e−rT

PriorTwo simple standard choices (or conjugate gamma dist’n):

• r known to be nonzero; it is a scale parameter:

p(r |M) =1

ln(ru/rl)

1

r

• r may vanish; require p(n|M) ∼ Const:

p(r |M) =1

ru

38 / 83

Page 39: Bayesian Analysis II

Prior predictive

p(n|M) =1

ru

1

n!

∫ ru

0dr(rT )ne−rT

=1

ruT

1

n!

∫ ruT

0d(rT )(rT )ne−rT

≈ 1

ruTfor ru ≫ n

T

PosteriorA gamma distribution:

p(r |n,M) =T (rT )n

n!e−rT

39 / 83

Page 40: Bayesian Analysis II

Gamma Distributions

A 2-parameter family of distributions over nonnegative x , withshape parameter α and scale parameter s:

pΓ(x |α, s) =1

sΓ(α)

(x

s

)α−1e−x/s

Moments:

E(x) = sα Var(x) = s2α

Our posterior corresponds to α = n + 1, s = 1/T .

• Mode r = nT

; mean 〈r〉 = n+1T

(shift down 1 with 1/r prior)

• Std. dev’n σr =√

n+1T

; credible regions found by integrating (canuse incomplete gamma function)

40 / 83

Page 41: Bayesian Analysis II

41 / 83

Page 42: Bayesian Analysis II

The flat priorBayes’s justification: Not that ignorance of r → p(r |I ) = C

Require (discrete) predictive distribution to be flat:

p(n|I ) =

∫dr p(r |I )p(n|r , I ) = C

→ p(r |I ) = C

Useful conventions

• Use a flat prior for a rate that may be zero

• Use a log-flat prior (∝ 1/r) for a nonzero scale parameter

• Use proper (normalized, bounded) priors

• Plot posterior with abscissa that makes prior flat

42 / 83

Page 43: Bayesian Analysis II

The On/Off Problem

Basic problem

• Look off-source; unknown background rate b

Count Noff photons in interval Toff

• Look on-source; rate is r = s + b with unknown signal s

Count Non photons in interval Ton

• Infer s

Conventional solution

b = Noff/Toff ; σb =√

Noff/Toff

r = Non/Ton; σr =√

Non/Ton

s = r − b; σs =√σ2

r + σ2b

But s can be negative!

43 / 83

Page 44: Bayesian Analysis II

Examples

Spectra of X-Ray Sources

Bassani et al. 1989 Di Salvo et al. 2001

44 / 83

Page 45: Bayesian Analysis II

Spectrum of Ultrahigh-Energy Cosmic Rays

Nagano & Watson 2000

HiRes Team 2007

log10(E) (eV)F

lux*

E3 /1

024 (

eV2 m

-2 s

-1 s

r-1)

AGASAHiRes-1 MonocularHiRes-2 Monocular

1

10

17 17.5 18 18.5 19 19.5 20 20.5 21

45 / 83

Page 46: Bayesian Analysis II

N is Never Large

Sample sizes are never large. If N is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is ‘large enough,’ you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners andsoutherners, different age groups, etc etc). N is never enoughbecause if it were ‘enough’ you’d already be on to the nextproblem for which you need more data.

— Andrew Gelman (blog entry, 31 July 2005)

46 / 83

Page 47: Bayesian Analysis II

N is Never Large

Sample sizes are never large. If N is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is ‘large enough,’ you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners andsoutherners, different age groups, etc etc). N is never enoughbecause if it were ‘enough’ you’d already be on to the nextproblem for which you need more data.

Similarly, you never have quite enough money. But that’s anotherstory.

— Andrew Gelman (blog entry, 31 July 2005)

46 / 83

Page 48: Bayesian Analysis II

Bayesian Solution to On/Off Problem

First consider off-source data; use it to estimate b:

p(b|Noff , Ioff ) =Toff(bToff)Noff e−bToff

Noff !

Use this as a prior for b to analyze on-source data. For on-sourceanalysis Iall = (Ion,Noff , Ioff):

p(s, b|Non) ∝ p(s)p(b)[(s + b)Ton]None−(s+b)Ton || Iall

p(s|Iall) is flat, but p(b|Iall) = p(b|Noff , Ioff), so

p(s, b|Non, Iall) ∝ (s + b)NonbNoff e−sTone−b(Ton+Toff )

47 / 83

Page 49: Bayesian Analysis II

Now marginalize over b;

p(s|Non, Iall) =

∫db p(s, b | Non, Iall)

∝∫

db (s + b)NonbNoff e−sTone−b(Ton+Toff )

Expand (s + b)Non and do the resulting Γ integrals:

p(s|Non, Iall) =

Non∑

i=0

Ci

Ton(sTon)ie−sTon

i !

Ci ∝(

1 +Toff

Ton

)i(Non + Noff − i)!

(Non − i)!

Posterior is a weighted sum of Gamma distributions, each assigning adifferent number of on-source counts to the source. (Evaluate viarecursive algorithm or confluent hypergeometric function.)

48 / 83

Page 50: Bayesian Analysis II

Example On/Off Posteriors—Short Integrations

Ton = 1

49 / 83

Page 51: Bayesian Analysis II

Example On/Off Posteriors—Long Background Integrations

Ton = 1

50 / 83

Page 52: Bayesian Analysis II

Recap of Key Findings From Examples

• Likelihood principle

• Sufficiency: Model-dependent summary of data

• Marginalization: Generalizes background subtraction,propagation of errors

• Student’s t for handling σ uncertainty

• Exact treatment of Poisson background uncertainty (don’tsubtract!)

51 / 83

Page 53: Bayesian Analysis II

Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

52 / 83

Page 54: Bayesian Analysis II

Statistical IntegralsInference with independent data

Consider N data, D = {xi}; and model M with m parameters.

Suppose L(θ) = p(x1|θ) p(x2|θ) · · · p(xN |θ).

Frequentist integralsFind long-run properties of procedures via sample spaceintegrals:

I(θ) =

∫dx1 p(x1|θ)

∫dx2 p(x2|θ) · · ·

∫dxN p(xN |θ)f (D, θ)

Rigorous analysis must explore the θ dependence; rarely donein practice.

“Plug-in” approximation: Report properties of procedure forθ = θ. Asymptotically accurate (for large N, expect θ → θ).

“Plug-in” results are easy via Monte Carlo (due toindependence).

53 / 83

Page 55: Bayesian Analysis II

Bayesian integrals∫dmθ g(θ) p(θ|M)L(θ) =

∫dmθ g(θ) q(θ)

p(θ|M)L(θ)

• g(θ) = 1 → p(D|M) (norm. const., model likelihood)

• g(θ) = ‘box’ → credible region

• g(θ) = θ → posterior mean for θ

Such integrals are sometimes easy if analytic (especially in lowdimensions), often easier than frequentist counterparts (e.g.,normal credible regions, Student’s t).

Asymptotic approximations: Require ingredients familiarfrom frequentist calculations. Bayesian calculation is not

significantly harder than frequentist calculation in this limit.

Numerical calculation: For “large” m (> 4 is often enough!)the integrals are often very challenging because of structure(e.g., correlations) in parameter space. This is usually pursuedwithout making any procedural approximations.

54 / 83

Page 56: Bayesian Analysis II

Bayesian ComputationLarge sample size: Laplace approximation

• Approximate posterior as multivariate normal → det(covar) factors• Uses ingredients available in χ2/ML fitting software (MLE, Hessian)• Often accurate to O(1/N)

Modest-dimensional models (d<∼10 to 20)

• Adaptive cubature• Monte Carlo integration (importance & stratified sampling, adaptive

importance sampling, quasirandom MC)

High-dimensional models (d>∼5)

• Posterior sampling — create RNG that samples posterior• MCMC is most general framework — Murali’s lab

55 / 83

Page 57: Bayesian Analysis II

Laplace Approximations

Suppose posterior has a single dominant (interior) mode at θ. Forlarge N,

π(θ)L(θ) ≈ π(θ)L(θ) exp

[−1

2(θ − θ)I(θ − θ)

]

where I = −∂2 ln[π(θ)L(θ)]

∂2θ

∣∣∣∣∣θ

= Negative Hessian of ln[π(θ)L(θ)]

= “Observed Fisher info. matrix” (for flat prior)

≈ Inverse of covariance matrix

E.g., for 1-d Gaussian posterior, I = 1/σ2θ

56 / 83

Page 58: Bayesian Analysis II

Marginal likelihoods∫

dθ π(θ)L(θ) ≈ π(θ)L(θ) (2π)m/2∣∣I∣∣−1/2

Marginal posterior densitiesProfile likelihood Lp(φ) ≡ maxη L(φ, η) = L(φ, η(φ))

→ p(φ|D,M) ∝∼ π(φ, η(φ))Lp(φ)∣∣Iη(φ)

∣∣−1/2

with Iη(φ) = ∂η∂η ln(πL)|η

Posterior expectations∫

dθ f (θ)π(θ)L(θ) ∝∼ f (θ)π(θ)L(θ) (2π)m/2∣∣I∣∣−1/2

where θ maximizes f πL

Tierney & Kadane, “Accurate Approximations for Posterior Moments and Marginal

Densities,” JASA (1986)

57 / 83

Page 59: Bayesian Analysis II

FeaturesUses output of common algorithms for frequentist methods(optimization, Hessian)

Uses ratios → approximation is often O(1/N) or better

Includes volume factors that are missing from commonfrequentist methods (better inferences!)

58 / 83

Page 60: Bayesian Analysis II

DrawbacksPosterior must be smooth and unimodal (or well-separatedmodes)

Mode must be away from boundaries (can be relaxed)

Result is parameterization-dependent—try to reparameterizeto make things look as Gaussian as possible (e.g., θ → log θto straighten curved contours)

Asymptotic approximation with no simple diagnostics (likemany frequentist methods)

Empirically, it often does not work well for m>∼10

59 / 83

Page 61: Bayesian Analysis II

Relationship to BICLaplace approximation for marginal likelihood:

Z ≡∫

dθ π(θ)L(θ) ≈ π(θ)L(θ) (2π)m/2∣∣I∣∣−1/2

∼ π(θ)L(θ) (2π)m/2m∏

k=1

σθk

We expect σθk∼ 1/

√N

Bayesian Information Criterion (BIC; aka Schwarz criterion):

−1

2BIC = lnL(θ) − m

2lnN

This is a very crude approximation to lnZ ; it captures theasymptotic N dependence, but omits factors O(1). Canjustify in some i.i.d. settings using “unit info prior.”BIC ∼ Bayesian counterpart to adjusting χ2 for d.o.f., but partly accounts forparameter space volume (consistent!)

Can be useful for identifying cases where an an accurate but hard Z calculationis useful (esp. for nested models, where some missing factors cancel)

60 / 83

Page 62: Bayesian Analysis II

Modest-D: Quadrature & CubatureQuadrature rules for 1-D integrals (with weight function h(θ)):

∫dθ f (θ) =

∫dθ h(θ)

f (θ)

h(θ)

≈∑

i

wi f (θi) + O(n−2) or O(n−4)

Smoothness → fast convergence in 1-D

Curse of dimensionality: Cartesian product rules converge slowly,O(n−2/m) or O(n−4/m) in m-D

W i k i p e d i a61 / 83

Page 63: Bayesian Analysis II

Monomial Cubature Rules

Seek rules exact for multinomials (× weight) up to fixed monomialdegree with desired lattice symmetry.

Number of points required grows much more slowly with m thanfor Cartesian rules (but still quickly)

A 7th order rule in 2-d

62 / 83

Page 64: Bayesian Analysis II

Adaptive Cubature

• Subregion adaptive cubature: Use a pair of monomial rules(for error estim’n); recursively subdivide regions w/ large error(ADAPT, DCUHRE, BAYESPACK, CUBA). Concentrates pointswhere most of the probability lies.

• Adaptive grid adjustment: Naylor-Smith methodIteratively update abscissas and weights to make the(unimodal) posterior approach the weight function.

These provide diagnostics (error estimates or measures ofreparameterization quality).

63 / 83

Page 65: Bayesian Analysis II

Analysis of Galaxy Polarizations

64 / 83

Page 66: Bayesian Analysis II

Monte Carlo Integration

Choose points randomly rather than deterministically; law of largenumbers says sample average → expectation:

∫dθ g(θ)p(θ) ≈ 1

n

θi∼p(θ)

g(θi ) + O(n−1/2)

[∼ O(n−1) with

quasi-MC

]

Ignores smoothness → poor performance in 1-D

Avoids curse: O(n−1/2) regardless of dimension

Practical problems:

• p(θ) must be a normalized density we can draw IID samplesfrom—perhaps the prior, but. . .

• O(n−1/2) multiplier (std. dev’n of g) may be large

→ hard if dimension >∼ 5–10

65 / 83

Page 67: Bayesian Analysis II

Importance sampling

∫dθ g(θ)q(θ) =

∫dθ g(θ)

q(θ)

P(θ)P(θ) ≈ 1

n

θi∼P(θ)

g(θi )q(θi)

P(θi)

Choose P to make variance small. (Not easy!)( x )�xQ * ( x ) P * ( x )

Can be used for both model comparison (marginal likelihood calculation),and parameter estimation.

Adaptive Monte Carlo: Build the importance sampler on-the-fly (e.g.,VEGAS, miser in Numerical Recipes); annealing adaptive importancesampling (Liu+ 2010). . . 66 / 83

Page 68: Bayesian Analysis II

High-D: Posterior Sampling Via MCMC

Monte Carlo integration:

∫dθ g(θ)p(θ) ≈ 1

n

θi∼p(θ)

g(θi ) + O(n−1/2)

It is very hard to build multivariate IID samplers.

Key idea — abandon IID:

• Metropolis-Hastings algorithmprovides simple recipe

• It can sample from unnormalized

densitiesMarkov chain

π(θ)L(θ) contours

θ1

θ2

Initial θ

Covered in Murali’s lab!

67 / 83

Page 69: Bayesian Analysis II

Much More to Computational BayesAdaptive MCMC• Single-chain adaptive proposals (use many past states)

• Population-based MCMC (e.g., differential evolution MCMC)

Model uncertainty• Marginal likelihood computation: Thermodynamic integration,

bridge sampling

• MCMC in model space: Reversible jump MCMC, birth/death

• Exploration of large (discrete) model spaces (e.g., variableselection): Shotgun stochastic search, Bayesian adaptive sampling

Sequential Monte Carlo• Particle filters for dynamical models (posterior tracks a changing

state)

• Adaptive importance sampling (“evolve” posterior via annealing oron-line processing)

This is just a small sampling!

68 / 83

Page 70: Bayesian Analysis II

Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

69 / 83

Page 71: Bayesian Analysis II

Interpreting Probability Distributions

Bayesian

Probability quantifies uncertainty in an inductive inference. p(x)

describes how probability is distributed over the possible values x

might have taken in the single case before us:

P

x

p is distributed

x has a single,uncertain value

Frequentist

Probabilities are always (limiting) rates/proportions/frequencies in

an ensemble. p(x) describes variability, how the values of x are

distributed among the cases in the ensemble:

x is distributed

x

P

70 / 83

Page 72: Bayesian Analysis II

Probability & Frequency

Frequencies are relevant when modeling repeated trials, orrepeated sampling from a population or ensemble.

Frequencies are observables

• When available, can be used to infer probabilities for next trial

• When unavailable, can be predicted

Bayesian/Frequentist relationships

• Relationships between probability and frequency

• Long-run performance of Bayesian procedures

71 / 83

Page 73: Bayesian Analysis II

Relationships Between Probability & FrequencyFrequency from probability

Bernoulli’s law of large numbers: In repeated i.i.d. trials, givenP(success| . . .) = α, predict

nsuccess

Ntotal

→ α as Ntotal → ∞

If p(x) does not change from sample to sample, it may beinterpreted as a frequency distribution.

Probability from frequencyBayes’s “An Essay Towards Solving a Problem in the Doctrineof Chances” → First use of Bayes’s theorem:

Probability for success in next trial of i.i.d. sequence:

E(α) → nsuccess

Ntotal

as Ntotal → ∞

If p(x) does not change from sample to sample, it may beestimated from a frequency distribution.

72 / 83

Page 74: Bayesian Analysis II

Subtle Relationships For Non-IID Cases

Predict frequency in dependent trialsrt = result of trial t; p(r1, r2 . . . rN |M) known; predict f :

〈f 〉 =1

N

t

p(rt = success|M)

where p(r1|M) =∑

r2

· · ·∑

rN

p(r1, r2 . . . |M3)

Expected frequency of outcome in many trials =average probability for outcome across trials.But also find that σf needn’t converge to 0.

A formalism that distinguishes p from f from the outset is particularlyvaluable for exploring subtle connections.

73 / 83

Page 75: Bayesian Analysis II

Frequentist Performance of Bayesian ProceduresMany results known for parametric Bayes performance:

• Estimates are consistent if the prior doesn’t exclude the true value.• Credible regions found with flat priors are typically confidence

regions to O(n−1/2); “reference” priors can improve theirperformance to O(n−1).

• Marginal distributions have better frequentist performance thanconventional methods like profile likelihood. (Bartlett correction,ancillaries are competitive but hard.)

• Bayesian model comparison is asymptotically consistent (not true ofsignificance/NP tests, AIC).

• For separate (not nested) models, the posterior probability for thetrue model converges to 1 exponentially quickly.

• Misspecification: Bayes converges to the model with sampling dist’nclosest to truth via Kullback-Leibler

• Wald’s complete class theorem: Optimal frequentist methods areBayes rules (equivalent to Bayes for some prior)

• . . .

Parametric Bayesian methods are typically good frequentist methods.

(More complicated in nonparametric problems.)74 / 83

Page 76: Bayesian Analysis II

Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

75 / 83

Page 77: Bayesian Analysis II

Tools for Computational BayesAstronomer/Physicist Tools

• BIE http://www.astro.umass.edu/∼weinberg/BIE/

Bayesian Inference Engine: General framework for Bayesian inference, tailored toastronomical and earth-science survey data. Built-in database capability tosupport analysis of terabyte-scale data sets. Inference is by Bayes via MCMC.

• XSpec, CIAO/SherpaBoth environments have some basic Bayesian capability (including basic MCMCin XSpec)

• CosmoMC http://cosmologist.info/cosmomc/

Parameter estimation for cosmological models using CMB, etc., via MCMC• ExoFit http://www.homepages.ucl.ac.uk/∼ucapola/exofit.html

Adaptive MCMC for fitting exoplanet RV data• CDF Bayesian Limit Software

http://www-cdf.fnal.gov/physics/statistics/statistics software.html

Limits for Poisson counting processes, with background & efficiencyuncertainties

• root/RooStatshttp://root.cern.ch/root/html/ROOFIT ROOSTATS Index.html

Statistical tools for particle physicists; Bayesian support being incorporated• CUBA http://www.feynarts.de/cuba/

Multidimensional integration via adaptive cubature; adaptive importancesampling & stratification; QMC (C/C++, Fortran, and Mathematica)

• Inference Forthcoming at http://inference.astro.cornell.edu/Several self-contained Bayesian modules; Parametric Inference Engine

76 / 83

Page 78: Bayesian Analysis II

Python

• PyMC http://code.google.com/p/pymc/

A framework for MCMC via Metropolis-Hastings; also implements Kalmanfilters and Gaussian processes. Targets biometrics, but is general.

• SimPy http://simpy.sourceforge.net/

SimPy (rhymes with ”Blimpie”) is a process-oriented public-domain package fordiscrete-event simulation.

• RSPython http://www.omegahat.org/

Bi-directional communication between Python and R• MDP http://mdp-toolkit.sourceforge.net/

Modular toolkit for Data Processing: Current emphasis is on machine learning(PCA, ICA. . . ). Modularity allows combination of algorithms and other dataprocessing elements into “flows.”

• Orange http://www.ailab.si/orange/

Component-based data mining, with preprocessing, modeling, and explorationcomponents. Python/GUI interfaces to C + + implementations. Some Bayesiancomponents.

• ELEFANT http://rubis.rsise.anu.edu.au/elefant

Machine learning library and platform providing Python interfaces to efficient,lower-level implementations. Some Bayesian components (Gaussian processes;Bayesian ICA/PCA).

77 / 83

Page 79: Bayesian Analysis II

R and S

• CRAN Bayesian task viewhttp://cran.r-project.org/web/views/Bayesian.html

Overview of many R packages implementing various Bayesian models andmethods; pedagogical packages; packages linking R to other Bayesian software(BUGS, JAGS)

• Omega-hat http://www.omegahat.org/RPython, RMatlab, R-Xlisp

• BOA http://www.public-health.uiowa.edu/boa/

Bayesian Output Analysis: Convergence diagnostics and statistical and graphicalanalysis of MCMC output; can read BUGS output files.

• CODAhttp://www.mrc-bsu.cam.ac.uk/bugs/documentation/coda03/cdaman03.html

Convergence Diagnosis and Output Analysis: Menu-driven R/S plugins foranalyzing BUGS output

78 / 83

Page 80: Bayesian Analysis II

Java

• Hydra http://research.warnes.net/projects/mcmc/hydra/

HYDRA provides methods for implementing MCMC samplers using Metropolis,Metropolis-Hastings, Gibbs methods. In addition, it provides classesimplementing several unique adaptive and multiple chain/parallel MCMCmethods.

• YADAS http://www.stat.lanl.gov/yadas/home.html

Software system for statistical analysis using MCMC, based on themulti-parameter Metropolis-Hastings algorithm (rather thanparameter-at-a-time Gibbs sampling)

• Omega-hat http://www.omegahat.org/Java environment for statistical computing, being developed by XLisp-stat andR developers

79 / 83

Page 81: Bayesian Analysis II

C/C++/Fortran

• BayeSys 3 http://www.inference.phy.cam.ac.uk/bayesys/

Sophisticated suite of MCMC samplers including transdimensional capability, bythe author of MemSys

• fbm http://www.cs.utoronto.ca/∼radford/fbm.software.html

Flexible Bayesian Modeling: MCMC for simple Bayes, nonparametric Bayesianregression and classification models based on neural networks and Gaussianprocesses, and Bayesian density estimation and clustering using mixture modelsand Dirichlet diffusion trees

• BayesPack, DCUHREhttp://www.sci.wsu.edu/math/faculty/genz/homepage

Adaptive quadrature, randomized quadrature, Monte Carlo integration• BIE, CDF Bayesian limits, CUBA (see above)

80 / 83

Page 82: Bayesian Analysis II

Other Statisticians’ & Engineers’ Tools

• BUGS/WinBUGS http://www.mrc-bsu.cam.ac.uk/bugs/

Bayesian Inference Using Gibbs Sampling: Flexible software for the Bayesiananalysis of complex statistical models using MCMC

• OpenBUGS http://mathstat.helsinki.fi/openbugs/

BUGS on Windows and Linux, and from inside the R• JAGS http://www-fis.iarc.fr/∼martyn/software/jags/

“Just Another Gibbs Sampler;” MCMC for Bayesian hierarchical models• XLisp-stat http://www.stat.uiowa.edu/∼luke/xls/xlsinfo/xlsinfo.html

Lisp-based data analysis environment, with an emphasis on providing aframework for exploring the use of dynamic graphical methods

• ReBEL http://choosh.csee.ogi.edu/rebel/

Library supporting recursive Bayesian estimation in Matlab (Kalman filter,particle filters, sequential Monte Carlo).

81 / 83

Page 83: Bayesian Analysis II

Implemention

1 Simple examplesBinary OutcomesNormal DistributionPoisson Distribution

2 Bayesian calculationLarge N: Laplace approximationsModest-D: Cubature, Monte Carlo IntegrationHigh-D: Posterior Sampling

3 Probability and frequency

4 Bayesian Software

5 Closing Reflections

82 / 83

Page 84: Bayesian Analysis II

Closing Reflections

Philip Dawid (2000)

What is the principal distinction between Bayesian and classical statistics? It isthat Bayesian statistics is fundamentally boring. There is so little to do: justspecify the model and the prior, and turn the Bayesian handle. There is noroom for clever tricks or an alphabetic cornucopia of definitions and optimalitycriteria. I have heard people use this ‘dullness’ as an argument againstBayesianism. One might as well complain that Newton’s dynamics, being basedon three simple laws of motion and one of gravitation, is a poor substitute forthe richness of Ptolemy’s epicyclic system.All my experience teaches me that it is invariably more fruitful, and leads todeeper insights and better data analyses, to explore the consequences of being a‘thoroughly boring Bayesian’.

Dennis Lindley (2000)

The philosophy places more emphasis on model construction than on formalinference. . . I do agree with Dawid that ‘Bayesian statistics is fundamentallyboring’. . .My only qualification would be that the theory may be boring but theapplications are exciting.

83 / 83