applied bayesian inference - kit · 1 introduction 1.1 course overview computing i r – mostly...

Applied Bayesian Inference

Prof. Dr. Renate Meyer1,2

1Institute for Stochastics, Karlsruhe Institute of Technology, Germany2Department of Statistics, University of Auckland, New Zealand

KIT, Winter Semester 2010/2011

Prof. Dr. Renate Meyer Applied Bayesian Inference 1 Prof. Dr. Renate Meyer Applied Bayesian Inference 2

1 Introduction 1.1 Course Overview

Overview: Applied Bayesian Inference A

I Bayes theorem, discrete – continuousI Conjugate examples: Binomial, ExponentialI Introduction to RI Simulation-based posterior computationI Introduction to WinBUGSI Regression, ANOVA, GLM, hierarchical models, survival analysis,

state-space models for time series, copulasI Basic model checking with WinBUGSI Convergence diagnostics with CODA

Prof. Dr. Renate Meyer Applied Bayesian Inference 3


Overview: Applied Bayesian Inference B

I Conjugate examples: Poisson, Normal, Exponential FamilyI Specification of prior distributionsI Likelihood PrincipleI Multivariate and hierarchical modelsI Techniques for posterior computationI Normal approximationI Non-iterative SimulationI Markov Chain Monte CarloI Bayes Factors, model checking and determinationI Decision-theoretic foundations of Bayesian inference



Computing

I R – mostly covered in classI WinBUGS – completely covered in classI Other – at your own risk


1 Introduction 1.2 Why Bayesian Inference?

Why Bayesian Inference?

Or: What is wrong with standard statistical inference?

The two mainstays of standard/classical statistical inference areI confidence intervals andI hypothesis tests.

Anything wrong with them?



Example: Newcomb’s Speed of Light

Example 1.1Light travels fast, but it is not transmitted instantaneously. Light takesover a second to reach us from the moon and over 10 billion years toreach us from the most distant objects yet observed in the expandinguniverse. Because radio and radar also travel at the speed of light, anaccurate value for that speed is important in communicating withastronauts and orbiting satellites. An accurate value for the speed oflight is also important to computer designers because electrical signalstravel only at light speed.The first reasonably accurate measurements of the speed of light weremade by Simon Newcomb between July and September 1882. Hemeasured the time in seconds that a light signal took to pass from hislaboratory on the Potomac River to a mirror at the base of theWashington Monument and back, a total distance of 7400m. His firstmeasurement was 24.828 millions of a second.



Newcomb’s Speed of Light: CI

Let us assume that the individual measurementsXi ∼ N(µ, σ2 = 0.0052) with known measurement varianceσ2 = 0.0052. We want to find a 95% confidence interval for µ.

Answer: x ± 1.96× σ/√

n

Because asX − µσ/√

n∼ N(0,1):

P(−1.96 <

X − µσ/√

n< 1.96

)= 0.95

P(X − 1.96σ/

√n < µ < X − 1.96σ/

√n)

= 0.95P(24.8182 < µ < 24.8378) = 0.95

This means that µ is in this interval with 95% probability.Certainly NOT!Prof. Dr. Renate Meyer Applied Bayesian Inference 8



After collecting the data and computing the CI, this interval eithercontains the true mean or it does not. Its coverage probability is not0.95 but either 0 or 1.

Then where does our 95% confidence come from?

Let us do an experiment:I draw 1000 samples of size 10 each from N(24.828,0.0052)

I for each sample calculate the 95% CII check whether the true µ = 24.828 is inside or outside the CI



Newcomb’s Speed of Light: Simulation

S1

Coverage to dateSample

90.0%10th

88.9%

100%

100%

100%

100%

100%

100%

9th

8th

7th

6th

5th

4th

3rd

100%2nd

100%1st

94.0%100th

……. …….

The Level of ConfidenceTrue mean

95.2%991st……. …….

95.2%1000th……. …….

24.8

Figure 1: Coverage over repeated sampling.




I 952 of the 1000 CIs include the true mean.I 48 of the 1000 CIs do not include the true mean.I In reality, we don’t know the true mean.I We do not sample repeatedly, we only take one sample and

calculate one CI.I Will this CI contain the true value?I It either will or will not but we do not know.I We take comfort in the fact that the method works 95% of the time

in the long run, i.e. the method produces a CI that contains theunknown mean 95% of the time that the method is used in thelong run.




By contrast, Bayesian confidence intervals, known as credible intervalsdo not require this awkward frequentist interpretation.

One can make the more natural and direct statement concerning theprobability of the unknown parameter falling in this interval.

One needs to provide additional structure to make this interpretationpossible.



Newcomb’s Speed of Light: Hypothesis Test

H0 : µ ≤ µ0(= 24.828) versus H1 : µ > µ0

I Test statistic:

U =X − µ0

σ/√

n∼ N(0,1) if µ = µ0

I Small values of uobs are consistent with H0, large values favour H1

I P-value:p = P(U > uobs|µ = µ0) = 1− Φ(u0)

I if P-value < 0.05 (= usual type I error rate), reject H0

The P-value is the probability that H0 is true.Certainly NOT.




The P-value is the probability to observe a value of the test statisticthat is more extreme than the actually observed value uobs if the nullhypothesis were true (under repeated sampling).

We can do another thought experimentI imagine we take 1000 samples of size 10 from a Normal

distribution with mean µ0.I we calculate the P-value for each sample.I it will only we smaller than 0.05 in about 5% of the samples, in

about 50 samples.I we take comfort in the fact that this test works 95% of the time in

the long run, i.e. rejects H0 even though H0 is true only in 5% ofthe cases that this method is used.




I It can only offer evidence against the null hypothesis. A largeP-value does not offer evidence that H0 is true.

I P-value cannot be directly interpreted as "weight of evidence" butonly as a long-term probability (in a hypothetical repetition of thesame experiment) of obtaining data at least as unusual as whatwas actually observed.

I Most practitioners are tempted to say that the P-value is theprobability that H0 ist true.

I P-values depend not only on the observed data but also thesampling probability of certain unobserved datapoints. Thisviolates the Likelihood Principle.

I This has serious practical implications for instance for the analysisof clinical trials, where often interim analyses and unexpecteddrug toxicities change the original trial design.




By contrast, the Bayesian approach to hypothesis testing, dueprimarily to Jeffreys (1961) is much simpler and avoids the pitfalls ofthe traditional Neyman-Pearson-based approach.

It allows the direct calculation of the probability that a hypothesis istrue and thus a direct and straightforward interpretation.

Again, as in the case of CIs, we need to add more structure to theunderlying probability model.


1 Introduction 1.3 Historical Overview

Historical Overview

Figure 2: From William Jefferys’ webpage, Univ. of Texas at Austin.



Inverse Probability

I Bayes and Laplace (late 1700’s) – inverse probabilityI Example: Given x successes in n iid trials with success probabilityθ

I probability – statements about observables given assumptionsabout unknown parameters

P(9 ≤ X ≤ 12|θ)

deductive

I inverse probability – statements about unknown parameters givenobserved data values

P(a < θ < b|X = 9)

inductive



Thomas Bayes

(b. 1702, London – d. 1761, Tunbridge Wells, Kent)

Bellhouse, D.R. (2004) The Reverend Thomas Bayes: FRS: ABiography to Celebrate the Tercentenary of His Birth. StatisticalScience 19(1):3-43.

Figure 3: Reverend Thomas Bayes 1702-1761.



Bayes’ Biography

Presbyterian minister and mathematician

Son of one of the first 6 Nonconformist ministers in England

Private education (by De Moivre?)

Ordained as Nonconformist minister and took the position as ministerat the Presbyterian Chapel, Tunbridge Wells

Educated and interested in mathematics, probability and statistics,believed to be the first to use probability inductively, defended theviews and philosophy of Sir Isaac Newton against criticism by BishopBerkeley

Two papers published while he was still living:I Divine Providence and Government is the Happiness of His

Creatures (1731)I An Introduction to the Doctrine of Fluxions, and a Defense of the

Analyst (1736)Prof. Dr. Renate Meyer Applied Bayesian Inference 20


Bayes’ Biography

Elected Fellow of the Royal Society in 1742Most well-known paper published posthumously, submitted by hisfriend Richard Price,”Essay Towards Solving a Problem in the Doctrine of Chances" (1763),Philosophical Trans. of the Royal Society of Londonbegins with :

Given the number of times in which an unknown eventhas happened and failed: Required the chance that theprobability of its happening in a single trial liessomewhere between any two degrees of probability thatcan be named.



Bayes’ Biography

Figure 4: Bayes’ vault at Bunhill Fields, London



18 and 19th Century

Bayes laid the foundations of modern Bayesian statistics

Pierre Simon Laplace (1749-1827), French mathematician andastronomer, developed mathematical astronomy and statisticsrefined inverse probablity, acknowledging Bayes’ work in a monographin 1812

George Boole challenged inverse probability in his Laws of Thought in1854. The Bayesian approach has been controversial ever since butwas predominent in practical applications until the early 20th centurybecause of a lack of a frequentist alternative. Inverse probabilitybecame an integral part of the Universities’ statistics curriculum.



20th Century

Sir R.A. Fisher (1890-1962) was a lifelong critic of inverse probability.and one of the most important persons involved in the demise ofinverse probability.

Figure 5: Sir Ronald A. Fisher (1890-1962) .Prof. Dr. Renate Meyer Applied Bayesian Inference 24


20th Century

Fisher’s (1922) paper revolutionized statistical thinking by introducingthe notions of "maximum likelihood", "sufficiency", and "efficiency". Hismain argument was that one needed to look at the likelihood of thedata given the theory NOT the likelihood of the theory given the data.He thus advocated an "indirect" approach to statistical inference basedon ideas of logic called "proof by contradiction".His work impressed two young statisticians at University CollegeLondon: J. Neyman and E. Pearson. They developed the mathematicaltheory of significance testing and confidence intervals which had ahuge influence on statistical applications (for good or bad).



Rise of Subjective Probability

Inverse probability ideas were studied by Keynes (1921), Borel (1921)and Ramsay (1926).In 1930’s Harold Jeffreys engaged in a published exchange with R.A.Fisher on Fisher’s fiducial argument and Jeffreys’ inverse probability.Jeffreys’ (1939) book on "Theory of Probability" is the most cited in thecurrent "objective Bayesian" literature.In Italy in the 1930s, Bruno de Finetti gave a different justification forsubjective probability, introducing the notion of "exchangeability".Neo-Bayesian revival in 1950s (Savage, Good, Lindley. . . ).Current huge popularity of Bayesian methods is due to fast computersand MCMC methods.Syntheses of Bayesian and non-Bayesian methods? see e.g. Efron(2005) "Bayesians, frequentists, and scientists"


1 Introduction 1.4 Bayesian and Frequentist Inference

Two main approaches to statistical inference

I the Bayesian approach

- parameters are random variables- subjective probability (for some)

I the frequentist/conventional/classical/orthodox approach

- parameters are fixed but unknown quantities- probability as long-run relative frequency

I Some controversy in the past (and present)

I In this course: not adversarial



Motivating Example: CPCRA AIDS Trial

Carlin and Hodges (1999), BiometricsI Compare two treatments for Mycobacterium avium complex, a

disease common in late-stage HIV-infected peopleI Total of 69 patientsI In 11 clinical centersI 5 deaths in treatment group 1I 13 deaths in treatment group 2



Primary Endpoint Data

Unit Treatm. Time Unit Treatm. Time Unit Treatm. TimeA 1 74+ B 2 4+ F 1 6A 2 248 B 1 156+ F 2 16+A 1 272+ F 1 76A 2 244 C 2 20+ F 2 80D 2 20+ E 1 50+ F 2 202D 2 64 E 2 64+ F 1 258+D 2 88 E 2 82 F 1 268+D 2 148+ E 1 186+ F 2 368+D 1 162+ E 1 214+ F 1 380+D 1 184+ E 1 214 F 1 424+D 1 188+ E 2 228+ F 2 428+D 1 198+ E 2 262 F 2 436+D 1 382+D 1 436+G 2 32+ H 2 22+ I 2 8G 1 64+ H 1 22+ I 2 16+G 2 102 H 1 74+ I 2 40G 2 162+ H 1 88+ I 1 120+G 2 182+ H 1 148+ I 1 168+G 1 364+ H 2 162 I 2 174+J 1 18+ K 1 28+ I 1 268+J 1 36+ K 1 70+ I 2 276J 2 160+ K 2 106+ I 1 286+J 2 254 I 1 366

I 2 396+I 2 466+I 1 468+



Data Safety and Monitoring Board

Decision based on:I Stratified Cox proportional hazards model

relative risk r =1.9 with 95%-CI [0.6,5.9],P-value 0.24

I Unstratified Cox proportional hazards modelrelative risk r =3.1 with 95%-CI [1.1,8.7],P-value 0.02

On the basis of the stratified analysis, the Board would have had tocontinue the trial.The P-value of the unstratified analysis was small enough to convincethe Board to stop the trial.



Stratified Cox PH Model

Why does the stratified analysis fail to detect the treatment difference?Contribution of i th stratum to partial likelihood:

Li(β) =

di∏k=1

(eβ

′xik∑j∈Rik

eβ′xik

)

If the largest time in i th stratum is a death, then the partial likelihoodderives no information from this event.

This is the case in the study: 4 deaths that have largest survival timeper stratum and these are all in treatment group 2.



Compromise Stratified-Unstratified Analysis?

Stratified: Unstratified:hi(t) = h0i(t) exp(β′x) hi(t) = h0(t) exp(β′x)

I unit-specific dummy variablesI frailty modelI stratum-specific baseline hazards are random draws from a

certain population of hazard functions

Bayesian analysis offers a flexibility in modelling, that is not possiblewith the frequentist approach.We will analyze this example in a Bayesian way in Chapter 4.



Some Advantages of Bayesian Inference

I Highly nonlinear models with many parameters can be analyzedI Offers hitherto unknown flexibility in statistical modellingI Can handle "nuisance" parameters that pose problems for

frequentist inferenceI Does not rely on large sample asymptotics, but gives valid

inference also for small sample sizesI Possibility to incorporate prior knowledge and expert judgementI Adheres to the Likelihood Principle




1 Introduction 1.5 Discrete Version of Bayes’ Theorem

Reminder of Bayes’ Theorem: Discrete Case

Theorem 1.2Let A1,A2, . . . ,An be a set of mutually exclusive and exhaustive events.Then

P(Ai |B) = P(Ai)P(B|Ai)/P(B)

=P(Ai)P(B|Ai)∑nj=1 P(Aj)P(B|Aj)

.



Chess Example

Example 1.3You are in a chess tournament and will play your next game againsteither Jun or Martha, depending on results of some other games.Suppose your probability of beating Jun is 7

10 , but of beating Martha isonly 2

10 . You assess your probability of playing Jun as 14 .

I How likely is it that you win your next game?Given:P(W |J) = 7

10 , P(W |M) = 210

P(J) = 14 , P(M) = 3

4Then P(W )= P(W |J)P(J) + P(W |M)P(M)= 7

1014 + 2

1034 = 13

40 = 0.325.



Chess Example

I Now suppose that you tell me you won your next chess game.Who was your opponent?

P(J|W )

=P(W |J)P(J)

P(W |J)P(J) + P(W |M)P(M)=

713



Diagnostic Testing

Example 1.4A new home HIV test is claimed to have 95% sensitivity and 98%

specificity. In a population with an HIV prevalence of 1/1000, what isthe chance that someone testing positive actually has HIV? Let A be

the event that the individual is truly HIV positive and A be the eventthat the individual is truly HIV negative.P(A) = 0.001.Let B be the event that the test is positive. We want P(A|B).“95% sensitivity" means thatP(B|A) = 0.95.“98% specificity" means thatP(B|A) = 0.98 or P(B|A) = 0.02.



Diagnostic Testing

Now Bayes theorem says P(A|B)

=P(B|A)P(A)

P(B|A)P(A) + P(B|A)P(A)

=.95× .001

.95× .001 + .02× .999= .045.

Thus, over 95% of those testing positive will, in fact, not have HIV.

The following example caused a stir in 1991 after a US columnist, whocalls herself Marilyn Vos Savant, used it in her column. She gave thecorrect answer. A surprising number of mathematicians wrote to hersaying that she was wrong.



Monty Hall Problem

Example 1.5You are a contestant on the TV show “Let’s Make a Deal" and giventhe choice of three doors. Two of the doors have a goat behind themand one a car. You choose a door, say door 2, but before opening thechosen door, the emcee, Monty Hall, opens a door that has a goatbehind it (e.g. door 1). He gives you the option of revising your choiceor sticking to your first choice. What do you do?

Since either box 2 or box 3 must contain the key, he claimed that herprobability of winning had increased to 1

2 .

Obviously, choose box 3. The probability of finding the prize in eitherbox 1 or 3 is 2/3. As the emcee showed you that it is not in box 1, theprobability that it is in box 2 is 2/3.



Monty Hall Problem

With Bayes theorem:Let Ai = “car behind door No. i", i = 1,2,3.These form a partition.P(Ai) = 1

3 are the prior probabilities for i = 1,2,3.

Let B = “Monty Hall opens door 1 (with goat)"

P(B|A1) = 0 likelihood of A1P(B|A2) = 1

2 likelihood of A2P(B|A3) = 1 likelihood of A3

We want P(A3|B)

=P(B|A3)P(A3)

P(B|A1)P(A1) + P(B|A2)P(A2) + P(B|A3)P(A3)

=1× 1

3

0× 13 + 1

2 ×13 + 1× 1

3= 2

3 .Prof. Dr. Renate Meyer Applied Bayesian Inference 41


Bayes’ Theorem again

Let H1,H2, . . . ,Hn denote n hypotheses (mutually disjoint) and Dobserved data. Then Bayes theorem says:

P(Hi |D) =P(Hi)P(D|Hi)∑nj=1 P(Hj)P(D|Hj)

.

I P(D|Hi) are known as likelihoods, the likelihoods given to Hi by D,or statisticians usually say the “likelihood of Hi given D”. (Thisnotion is used extensively in frequentist statistical inference/method of maximum likelihood means finding the hypothesisunder which the observations are most likely to have occurred.)

I P(Hi) are prior probabilities.I P(Hi |D) are posterior probabilities.



Importance of Prior Plausibility

Example 1.6D = event that I look through my window and see a tall, branched thingwith green blobs covering its branches.

Why do I think it is a tree?

H1 = treeH2 = manH3 = something else

P(D|H1) is close to 1, whereas P(D|H2) is close to 0.But likelihood is not the only consideration in this reasoning.More specifically, let H3 = cardboard replica of a tree.Then P(D|H3) is close to 1.H3 has the same likelihood as H1, but it is not a plausible hypothesisbecause it has a very much lower prior probability.



Importance of Prior Plausibility

P(H1) has a high prior probability.P(H2) has a high prior probability.P(H3) has a low prior probability.Bayes theorem is in complete accord with this natural reasoning. Theposterior probabilities of the various hypotheses are in proportion tothe products of their prior probabilities and their likelihoods:

P(Hi |D) ∝ P(Hi)P(D|Hi)

Bayes theorem thus combines two sources of information:

prior information represented by prior probabilitiesnew information represented by likelihoodsThese together “add up” to the total information represented byposterior probabilities.


2 Bayesian Inference 2.1 Statistical Model

Notation and Definitions

Here, we only consider parametric models.We assume that the observations X1, . . . ,Xn have been generatedfrom a parametrized probability distribution, i.e., Xi (1 ≤ i ≤ n) has adistribution with probability density function (pdf) f (xi |θ) on IR, such thatthe parameters θ = (θ1, . . . , θp) are unknown and the pdf f is known.This model can then be represented more simply by X ∼ f (x|θ), wherex is the vector of observations and θ the vector of parameters.

Example: Xi ∼ N(µ, σ2) iid for i = 1, . . . ,n, Then

f (x|µ, σ2) =∏n

i=1 f (xi |µ, σ2) =∏n

i=11√2πσ

e−1

2σ2 (xi−µ)2

θ = (µ, σ2)


2 Bayesian Inference 2.1 Statistical Model

Notation and Definitions

Definition 2.1A parametric statistical model consists of the observation of a randomvariable X, distributed according to f (x|θ) where only the parameter θis unknown and belongs to a vector space Θ ⊂ IRp of finite dimension.

We are usually interested in questions of the form:

What is the value of θ1? −→ parameter estimationIs θ1 larger than θ3? −→ hypothesis testingWhat is the most likely value of a future event, whose distributiondepends on θ? −→ prediction


2 Bayesian Inference 2.2 Likelihood-based Functions

Overview

In this section, we will introduce (or remind you of)

I likelihood functionI maximum likelihood estimationI information criteriaI score functionI Fisher information



Likelihood Function

Definition 2.2The likelihood function of θ is the function that associates the valuef (x|θ) to each θ. This function is denoted by l(θ; x). Other commonnotations are lx(θ), l(θ|x) and l(θ). It is defined by

l(θ; x) = f (x|θ) (θ ∈ Θ) (2.1)

where x is the observed value of X.

The likelihood function associates to each value of θ, the probability ofan observed value x for X (if X is discrete). Then, the larger the valueof l the greater are the chances associated to the event underconsideration, using a particular value of θ. Therefore, by fixing thevalue of x and varying θ we observe the plausibility (or likelihood) ofeach value of θ. The likelihood function is of fundamental importancein many theories of statistical inference.



Maximum Likelihood Estimate

Definition 2.3Any vector θ maximizing (2.1) as a function of θ ∈ Θ, with x fixed,provides a maximum likelihood (ML) estimate of θ.

In intuitive terms, this gives the realization of θ most likely to havegiven rise to the current data set, an important finite sample property.

Note that even though∫

IRn f (x|θ)dx = 1,∫

Θ l(θ; x)dθ 6= 1, in general.



General Information Criteria

Modeling process: Suppose f belongs to some family F of meaningfulfunctional forms, but where the dimension p of the parameter may varyamong members of the family. Then choose f ∈ F to maximize

GIC = General Information Criterion = log l(θ; x)− αp2.

Here log l(θ; x) denotes the maximum of the log-likelihood function,and α

2 provides a penalty per parameter in the model.2 choices

I α = 2 (Akaike, 1978)

AIC = Akaike Information Criterion = log l(θ; x)− p

I α = log(n/2π) (Schwarz, 1978)

BIC = Bayesian Information Criterion = log l(θ; x)− p2

logn

2π



Binomial Example

Example 2.4X ∼ Binomial(2, θ). Then

f (x |θ) = l(θ; x)

=

(2x

)θx (1− θ)2−x , x = 0,1,2; θ ∈ Θ = (0,1)

and∑

x

f (x |θ) = 1

but∫ 1

0l(θ; x)dθ

=

(2x

)∫ 1

0θx (1− θ)2−xdθ =

(2x

)B(x + 1,3− x) =

136= 1.



Binomial Example

Note that:1. if x = 1 then l(θ; x = 1)= 2θ(1− θ).

The value of θ that gives highest likelihood to x = 1 or, in otherwords, the most likely value of θ is 0.5

2. If x = 2 then l(θ; x = 2)= θ2. The most likely value of θ is 1.3. If x = 0 then l(θ; x = 0)= (1− θ)2. The most likely value is 0.



Binomial Example

0.0 0.2 0.4 0.6 0.8 1.0

0.00.2

0.40.6

0.81.0

theta

likeliho

od

l(theta;x=0)l(theta;x=1)l(theta;x=2)

Figure 6: Likelihood function for different values of x .



Geometric Example

Example 2.5Let X1,X2, . . . ,Xn denote a random sample from a geometricdistribution with pdf

f (Xi = xi |θ) = θ(1− θ)xi−1 (xi = 1,2, . . .).

a) Find the likelihood function of θ.l(θ; x)

= P(X1 = x1,X2 = x2, . . . ,Xn = xn|θ) = f (x1, . . . xn|θ)

=∏n

i=1 f (xi |θ) =∏n

i=1 θ(1− θ)xi−1

= θn(1− θ)∑n

i=1(xi−1) = θn(1− θ)n(x−1)

(This is a Beta curve as a function of θ.)



Geometric Example

b) The maximum likelihood estimate θ of θ maximizes the probabilityof obtaining the observations actually observed. Find θ.Easier to maximize the log-likelihood.log l(θ; x) = n log θ + n(x − 1) log(1− θ)

ddθ log l(θ; x) = n

θ −n(x−1)

1−θ = 0 ⇐⇒nθ

= n(x−1)

1−θ⇐⇒

θ = 1x

d2

dθ2 = − nθ2 −

n(x−1)(1−θ)2 < 0 ∀θ

Thus θ is a global maximum.



Geometric Example

c) The invariance property of maximum likelihood estimates tells thatfor any function η = g(θ) of θ, η = g(θ) is the ML estimate of g(θ).Find the ML estimate of η = θ(1− θ) = P(X1 = 2).

η = θ(1− θ) = 1x

(1− 1

x

).



Exponential Example

Example 2.6Let X1,X2, . . .Xn denote a random sample from the exponentialdistribution with unknown location parameter θ, unknown scaleparameter λ, and pdf

f (x |θ, λ) = λexp{−λ(x − θ)} (θ < x <∞),

where −∞ < θ <∞ and 0 < λ <∞.The common mean and variance of the Xi are µ = θ + λ−1 andσ2 = λ−2. Find the likelihood function of θ and λ and the ML estimatesof µ and σ2, in situations where the observed values x1, x2, . . . , xn arenot all equal.



Exponential Example

The joint pdf of X1, . . . ,Xn is

f (x1, . . . , xn|θ, λ) =n∏

i=1

f (xi |θ, λ)

=n∏

i=1

λexp{−λ(xi − θ)}I(θ ≤ xi)

Thus, the likelihood of θ and λ when x1, . . . , xn are observed is

l(θ, λ; x1, . . . , xn) = λn exp

{−λ

n∑i=1

(xi − θ)

}n∏

i=1

I(θ ≤ xi)



Exponential Example

Defining z = min(x1, . . . , xn)

l(θ, λ; x1, . . . , xn) = λn exp{−λn(x − θ)}I(θ ≤ z)

As a function of θ

l(θ, λ; x1, . . . , xn) ∝{

exp(nλθ), θ ≤ z,0 otherwise.

This is maximized when θ = θ = z.Now as a function of λ, the likelihood is proportional to

g(λ) = λn exp{−aλ}

with a = n(x − θ) > 0 (if x1, . . . , xn are not all equal).



Exponential Example

Thenlog g(λ) = n logλ− aλ.

d log g(λ)

dλ=

nλ− a = 0⇐⇒

λ = λ =na

=1

x − z.

This is a global maximum as the 2. derivative is always negative.By the invariance property of ML estimators:

µ = θ + λ−1 = z + (x − z) = x ,σ2 = λ−2 = (x − z)2.



Fisher Information

Definition 2.7Let X be a random vector with pdf f (x|θ) depending on a 1-dim.parameter θ.The expected Fisher information measure of θ through X is defined by

I(θ) = EX|θ

[−∂

2 log f (X|θ)

∂θ2

].

If θ = (θ1, . . . , θp) is a vector then the expected Fisher informationmatrix of θ through X is defined by

I(θ) = EX|θ

[−∂

2 log f (X|θ)

∂θ∂θ′

]with elements Iij(θ) given by

Iij(θ) = EX|θ

[−∂

2 log f (X|θ)

∂θi∂θj

], i , j = 1, . . . ,p.



Fisher InformationThe information measure defined this way is related to the mean valueof the curvature of the likelihood. The larger this curvature is, thelarger is the information content summarized in the likelihood functionand so the larger will I(θ) be. Since the curvature is expected to benegative, the information value is taken as minus the curvature. Theexpectation is taken with respect to the sample distribution. Theobserved Fisher information corresponds to minus the secondderivative of the log likelihood:

JX(θ) =

[−∂

2 log f (X|θ)

∂θ∂θ′

]and is interpreted as a local measure of the information content whileits expected value, the expected Fisher information, is a globalmeasure.



Fisher Information Example

Example 2.8Let X ∼ N(θ, σ2) with σ2 known. It is easy to get I(θ) = JX(θ) = σ−2,the normal precision. Verify!

log f (X |θ) = log{ 1√2πσ

e−1

2σ2 (X−θ)2} = const .− 1

2σ2 (X − θ)2

ddθ

log f (X |θ) =2

2σ2 (X − θ) =X − θσ2

d2

dθ2 log f (X |θ) = − 1σ2

I(θ) = E[− d2

dθ2 log f (X |θ)

]= E

[1σ2

]=

1σ2 = JX (θ)

i.e. the normal precision



Fisher Information

One of the most useful properties of the Fisher information is theadditivity of the information with respect to independent observations.This means if X = (X1, . . . ,Xn) are independent random variables withdensities fi(x |θ) and I and Ii the expected Fisher information measuresobtained through X and Xi , respectively, then

I(θ) =n∑

i=1

Ii(θ).

This states that the total information obtained from independentobservations is the sum of the information of the individualobservations.



Score Function

Definition 2.9The score function of X, is defined as

U(X;θ) =∂ log f (X|θ)

∂θ.

One can show that under certain regularity conditions:

I(θ) = EX|θ[U2(X;θ)].

In a large number of situations, θ will, for large n, possess adistribution that is approximately multivariate normal with mean vectorθ and covariance matrix I(θ)−1.The vector I(θ)

12 (θ − θ) is said to converge in distribution, as n −→∞,

with p fixed, to a standard spherical normal distribution (i.e. amultivariate normal distribution N(0, Ip) with zero mean vector andcovariance matrix equal to the p × p identity matrix).



Example: Fisher Info for Binomial

Example 2.10Let X1, . . . ,Xn ∼ Binomial(1, θ). Show that the ML estimate of θ has anasymptotic N(θ, θ(1−θ)

n ) distribution.

Xi |θiid∼ Binomial(1, θ) with

E(Xi) = θ and Var(Xi) = θ(1− θ)

l(θ; x1, . . . , xn) =n∏

i=1

f (xi |θ) =n∏

i=1

θxi (1− θ)1−xi

= θ∑

xi (1− θ)n−∑

xi = θx (1− θ)n−x

where x =∑n

i=1 xi .log l(θ; x1, . . . , xn) = x log θ + (n − x) log(1− θ)



Example: Fisher Info for Binomial

ddθ

log l(θ; x1, . . . , xn) =xθ− n − x

1− θ= 0 ⇐⇒

θ = θ =xn

U(Xi ; θ) =ddθ

log f (Xi |θ) =Xi

θ− 1− Xi

1− θ=

Xi − θθ(1− θ)

U2(Xi ; θ) =(Xi − θ)2

θ2(1− θ)2

Ii(θ) = E [U2(Xi ; θ)] =Var(Xi)

θ2(1− θ)2 =θ(1− θ)

θ2(1− θ)2 =1

θ(1− θ)

I(θ) =n∑

i=1

Ii(θ) =n

θ(1− θ).


2 Bayesian Inference 2.3 Bayes’ Theorem: Continuous Case

Bayesian Statistical Model

Given data x whose distribution depends on an unknown parameter θ.We require inference about θ. (x and θ can be vectors, but we assumefor ease of notation that they are 1-dim.)

Definition 2.11A Bayesian statistical model consists of a parametric statistical model(the “sampling distribution” or “likelihood”), f (x |θ), and a priordistribution on the parameters f (θ).



Bayes’ theorem

Theorem 2.12Continuous version of Bayes’ theorem:

Given a Bayesian statistical model, we can update the prior pdf of θ tothe posterior pdf of θ given the data x :

f (θ|x) = f (θ)f (x |θ)/f (x)

=f (θ)f (x |θ)∫f (θ)f (x |θ)dθ

∝ prior× likelihood



Essential Distributions

Given a complete Bayesian model, we can construct:a) the joint distribution of (θ,X ),

f (θ, x) = f (x |θ)f (θ);

b) the marginal or prior predictive distribution of X ,

f (x) =

∫f (θ, x)dθ =

∫f (x |θ)f (θ)dθ;

c) the posterior distribution of θ

f (θ|x) =f (θ)f (x |θ)∫f (θ)f (x |θ)dθ

=f (θ)f (x |θ)

f (x);

d) the posterior predictive distribution for a future obs. Y given x ,

f (y |x) =

∫f (y , θ|x)dθ =

∫f (y |θ)f (θ|x)dθ.



Presentation of Posterior Distribution

After seeing the data x , what do we now know about the parameter θ?

I plot of posterior density functionI summary statistics like measures of location and

dispersion/precision(analogue to frequentist point estimates: e.g. posterior mean,median, mode)

I hypothesis test, e.g. H0 : θ ≤ θ0:

Pr(H0 true|x) = Pr(θ ≤ θ0|x) =

∫ θ0

−∞f (θ|x)dθ



Presentation of Posterior DistributionI analogue to frequentist confidence intervals:

central posterior interval andhighest posterior density region.

If F (θ|x) is the posterior cdf and ifF (θ1|x) = p1,F (θ2|x) = p2 > p1, then the interval (θ1, θ2] is aposterior interval of θ with coverage probability p2 − p1 (credibleinterval).If exactly 100(α/2)% of the posterior probability lies above andbelow the posterior interval, it is called a central posterior intervalwith coverage probability 1− α = p2 − p1.It is sometimes desirable to find an interval/region which is asshort as possible for a given coverage probability. This is called ahighest posterior density region (HPD).


3 Conjugate Distributions

Conjugate Distributions

The term conjugate refers to cases where the posterior distribution isin the same family as the prior distribution.In Bayesian probability theory, if the posterior distributions f (θ|x) are inthe same family as the prior distributions f (θ) for all θ ∈ Θ, the priorand posterior are called conjugate distributions, and the prior is calleda conjugate prior.The concept, as well as the term "conjugate prior", were introduced byHoward Raiffa and Robert Schlaifer in their work on Bayesian decisiontheory (1961).


3 Conjugate Distributions 3.1 Bernoulli Distribution – Discrete Prior

Bernoulli Trials – Discrete Prior

Assume a drug may have response rate θ of 0.2, 0.4, 0.6, 0.8, each ofequal prior probability. If we observe a single positive response(x = 1), how is our prior revised?

Likelihood:f (x |θ) = θx (1− θ)1−x

f (x = 1|θ) = θ

f (x = 0|θ) = 1− θ

Posterior:

f (θ|x) =f (x |θ)f (θ)∑j f (x |θj)f (θj)

∝ f (x |θ)f (θ)



Calculating the Posterior

θ prior likelihood × prior posteriorf (θ) f (x = 1|θ)f (θ) f (θ|x = 1)

.2 0.25 0.2 × 0.25 = 0.05 0.10

.4 0.25 0.4 × 0.25 = 0.10 0.20

.6 0.25 0.6 × 0.25 = 0.15 0.30

.8 0.25 0.8 × 0.25 = 0.20 0.40∑1.0 0.50 1.00

Note: a single positive response makes it 4 times as likely that the trueresponse rate is 80% rather than 20%.



Prior Predictive Distribution

With a Bayesian approach, prediction is straightforward.The prior predictive distribution of X is given by:

P(X = 1)= f (x = 1) =∑

j

f (x = 1|θj)f (θj) = 0.5

P(X = 0) = f (x = 0) = 1− f (x = 1) = 0.5

The prior predictive probability is thus a weighted average of thelikelihoods under the 4 possible values of θ:

f (x) =∑

j

wj f (x |θj) with ‘prior weights’ given by wj = f (θj).

Furthermore:

f (x = 1) =∑

j

θjwj = prior mean of θ = E [θ]



Posterior Predictive Distribution

Suppose we wish to predict the outcome of a new observation z, givenwhat we have already observed.

For discrete θ we have the posterior predictive distribution:

f (z|x) =∑

j

f (z, θj |x)

which, since z is usually conditionally independent of x given θ, isgenerally equal to

f (z|x) =∑

j f (z|θj , x)f (θj |x) =∑

j f (z|θj)wj(x)

where the wj(x) = f (θj |x) are ‘posterior weights’.




Example: The posterior predictive probability that the next treatment issuccessful:

f (z = 1|x = 1)

=∑

j f (z|θj)f (θj |x)

=∑

j θj f (θj |x) = posterior mean of θ

= 0.2× 0.1 + 0.4× 0.2 + 0.6× 0.3 + 0.8× 0.4 = 0.6


3 Conjugate Distributions 3.2 Binomial Distribution – Discrete Prior

Binomial response – Discrete Prior

If we observe r responses out of n patients, how is our prior revised?

Likelihood

f (x = r |θ) =

(nr

)θr (1− θ)n−r ∝ θr (1− θ)n−r

Suppose n = 20, r = 15

f (x = 15||θ) = θ15(1− θ)5



Binomial response – Discrete Prior

θ prior likelihood × prior posteriorf (θ) f (x = r |θ)f (θ) f (θ|x = r)

(×10−7)

.2 .25 0.0 0.0

.4 .25 0.2 0.005

.6 .25 12.0 0.298

.8 .25 28.1 0.697∑1.0 40.3 1.0



Binomial response – discrete Prior

After observing x = 15 successes, what is the posterior predictiveprobability of a positive response for patient No. 21?

f (z = 1|x = 15)

=∑

i f (z = 1|θi)f (θi |x = 15)

= 0.2× 0.0 + 0.4× 0.005 + 0.6× 0.298 + 0.8× 0.697

= 0.7384



Summary and Terminology (Discrete Prior)

Two random variables: X (observable), θ (unobservable).

Let X |θ ∼ Binomial(n, θ)(or Xj |θ ∼ Bernoulli(θ) conditionally independent for j = 1, . . . ,n),where the unknown parameter θ can attain I different values θi , with apriori probabilities f (θi), i = 1, . . . , I, respectively.

X |θ ∼ Binomial(n, θ) is called the sampling distribution.

f (θi), i = 1, . . . , I is called the prior distribution.

The likelihood function:

f (x |θ) =

(nx

)θx (1− θ)n−x ∝ θx (1− θ)n−x θ = θ1, . . . , θI

NOTE: This is considered as a function of θ only; x is considered fixed.




Prior predictive pdf of X :

f (x) =I∑

i=1

f (x |θi)f (θi) for x = 0,1, . . . ,n

(mean or weighted average of f (x |θ) with weights given by the priorprobabilities for θ, f (θi))

Posterior pdf of θ:

f (θi |x) =f (θi)f (x |θi)∑Ij=1 f (θj)f (x |θj)

=f (θi)f (x |θi)

f (x)

∝ f (θi)f (x |θi) i = 1, . . . , I




Posterior predictive pdf for another future observation Y of theBernoulli experiment:

f (y |x) =I∑

i=1

f (y |θi)f (θi |x)

(mean or weighted average of f (x |θ) with weights given by theposterior probabilities for θ, f (θi |x))

As Y can attain only the values 0, 1 this gives:

f (1|x) =I∑

I=1

θi f (θi |x) = posterior mean of θ

f (0|x) = 1− f (1|x)


3 Conjugate Distributions 3.3 Binomial Distribution – Continuous Prior

Binomial Response – Continuous Prior

Data: x successes from n independent trials

Likelihood:

f (x |θ) =

(nx

)θx (1− θ)n−x ∝ θx (1− θ)n−x

Prior: flexible ‘conjugate’ beta family

θ ∼ Beta(α, β)

f (θ) =Γ(α + β)

Γ(α)Γ(β)θα−1(1− θ)β−1

∝ θα−1(1− θ)β−1



Calculating Posterior

Posterior:

f (θ|x)

∝ f (x |θ)f (θ)

∝ θx (1− θ)n−xθα−1(1− θ)β−1

∝ θα+x−1(1− θ)β+n−x−1

∼ Beta(α + x , β + n − x)

Note: the Binomial and Beta distributions are conjugate distributions



Posterior Moments

For a beta(α, β) distribution:

mode m = (α− 1)/(α + β − 2)

mean µ = α/(α + β)

variance σ2 = µ(1− µ)/(α + β + 1) = αβ/[(α + β)2(α + β + 1)]

Suppose our prior estimate of the response rate is 0.4 with a standarddeviation of 0.1.

Solving µ = 0.4 and σ2 = 0.12 gives α = 9.2, β = 13.8.

Convenient to think of this as equivalent to having observed 9.2successes in α + β = 23 patients.

prior likelihood posteriorsuccesses 9.2 15 24.2

failures 13.8 5 18.8



Prior and Posterior Densities

theta

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

likelihoodpriorposterior

Figure 7: Prior, likelihood, and posterior density of θ.



Prior and Posterior Means and Modes

Compare modes of prior, likelihood and posterior:

prior mode:8.221

= 0.39

mode of likelihood:1520

= 0.75

posterior mode:23.241

= 0.57

Compare means of prior, data and posterior:

prior mean:9.223

= 0.4

data mean:1520

= 0.75

posterior mean:24.243

= 0.56



Compromise

In general, the posterior mean is a compromise between prior meanand data mean, i.e. for some w , 0 ≤ w ≤ 1:

posterior mean = wprior mean + (1− w) data mean

x + α

n + α + β= w

α

α + β+ (1− w)

xn

Solve w.r.t. w :

x + α

n + α + β=

α + β

n + α + β

α

α + β+

nn + α + β

xn

i.e.w =

α + β

n + α + β

prior gets weight α+βn+α+β −→ 0 for n→∞

data gets weight nn+α+β −→ 1 for n→∞



Compromise

"A Bayesian is one who, vaguely expecting a horse, and catching aglimpse of a donkey, strongly believes he has seen a mule. "



Hypothesis Test

H0 : θ > θ0 = 0.4

Calculate prior and posterior probability of H0:

P(θ > θ0) =

∫ 1

θ0

f (θ)dθ = 1−∫ θ0

0f (θ)dθ = 1− FBeta(α,β)(θ0)

P(θ > θ0|x) =

∫ 1

θ0

f (θ|x)dθ = 1−∫ θ0

0f (θ|x)dθ = 1−FBeta(α+x ,β+n−x)(θ0)

For θ0 = 0.4, use R function

> priorprob=1-pbeta(0.4,9.2,13.8)> priorprob[1] 0.4886101> postprob=1-pbeta(0.4,24.2,18.8)> postprob[1] 0.9842593



Analogue to Confidence Interval

Posterior Credible Interval95% central posterior credible interval for θ: (θl , θu)

where

0.95 =

∫ θu

θl

f (θ|x)dθ

θl and θu are 2.5% and 97.5% quantiles of posterior

Use R function

> l=qbeta(0.025,24.2,18.8)l[1] 0.4142266> u=qbeta(0.975,24.2,18.8)> u[1] 0.7058181




What is the posterior predictive success probability for a furthern + 1 = 21st patient entering the trial?

P(Xn+1 = 1|x) =

∫ 1

0f (xn+1 = 1|θ)f (θ|x1, . . . , xn)dθ

=

∫ 1

0θ

Γ(n + α + β)

Γ(α + x)Γ(β + n − x)θα+x−1(1− θ)β+n−x−1dθ

=Γ(n + α + β)

Γ(α + x)Γ(β + n − x)

∫ 1

0θα+x (1− θ)β+n−x−1dθ

=Γ(n + α + β)

Γ(α + x)Γ(β + n − x)

Γ(α + x + 1)Γ(β + n − x)

Γ(n + α + β + 1)

=Γ(n + α + β)

Γ(α + x)Γ(β + n − x)

(α + x)Γ(α + x)Γ(β + n − x)

(n + α + β)Γ(n + α + β)

=α + x

α + β + n=

9.2 + 1523 + 20

= 0.562797




If N = 100 further patients enter the trial, what is the posteriorpredictive distribution of the number of successes?Let Y |θ) ∼ Binomial(N, θ). Then for y = 0,1, . . . ,N, f (y |x)

=

∫ 1

0f (y |θ)f (θ|x)dθ

=

∫ 1

0

(Ny

)θy (1− θ)N−y Γ(n + α + β)

Γ(α + x)Γ(β + n − x)θα+x−1(1− θ)β+n−x−1dθ

=

(Ny

)Γ(n + α + β)

Γ(α + x)Γ(β + n − x)

∫ 1

0θy+α+x−1(1− θ)N−y+β+n−x−1dθ

=

(Ny

)Γ(n + α + β)

Γ(α + x)Γ(β + n − x)

Γ(α + β + n + N)

Γ(α + x + y)Γ(β + n − x + N − y)

This is called a Beta-Binomial distribution.




3 Conjugate Distributions 3.4 Exchangeability

Independence?

A common statement in statistics:Assume X1, . . . ,Xn are iid random variables

In Bayesian statistics, we need to think hard about independence.Why?

I Consider two ”independent" Bernoulli trials with probability ofsuccess θ.

I It is true that

f (x1, x2|θ) = θx1+x2(1− θ)2−x1−x2 ∝ f (x1|θ)f (x2|θ)

so that X1 and X2 are independent given θ.I But f (x1, x2) =

∫f (x1, x2|θ)f (θ)dθ may not factor.



Marginal Bivariate Distribution

I If f (θ) = Unif(0,1), then

f (x1, x2) =

∫f (x1, x2|θ)f (θ)dθ

=

∫ 1

0θx1+x2(1− θ)2−x1−x2dθ

=Γ(x1 + x2 + 1)Γ(3− x1 − x2)

Γ(4)



Exchangeability

If independence is no longer the key, then what is?Exchangeability

I Informal definition: subscripts don’t matterI More formally: Given events A1,A2, . . . ,An, we say they are

exchangeable if

P(A1,A2, . . .Ak ) = P(Ai1 ,Ai2 , . . .Aik )

for every k where i1, i2, . . . , in are permutations of the indicesI Similarly, given random variables X1,X2, . . . ,Xn, we say that they

are exchangeable if

P(X1 ≤ x1, . . . ,Xk ≤ xk ) = P(Xi1 ≤ xi1 , . . . ,Xik ≤ xik )

for every k .



Relationship between exchangeability and independence

I rv’s that are iid given θ are exchangeableI an infinite sequence of exchangeable rv’s can always be thought

of as iid given some parameter(DeFinetti’s theorem)

I note previous point requires an infinite sequence

What is not exchangeable?I time series, spatial dataI may become exchangeable if we explicitly include time in the

analysisi.e. x1, x2, . . . , xt , . . . are not exchangeable but(t1, x1), (t2, x2), . . . may be


3 Conjugate Distributions 3.5 Sequential Learning

Sequential Inference

Suppose we obtain an observation x1 and form the posteriorf (θ|x1) ∝ f (x1|θ)f (θ) and then we obtain a further observation x2 whichis conditionally independent of x1 given θ. The posterior on x1, x2 isgiven by:

f (θ|x1, x2) ∝ f (x2|θ, x1)× f (θ|x1)

∝ f (x2|θ)× f (θ|x1)

“Today’s posterior is tomorrow’s prior!”

The resultant posterior is the same as if we have obtained the datax1, x2 together:

f (θ|x1, x2) ∝ f (x1, x2|θ)× f (θ)

∝ f (x2|θ)× f (x1|θ)× f (θ)








3 Conjugate Distributions 3.6 Comparing Bayesian and Frequentist Inference for Proportion

Comparing Bayesian and Frequentist Inference for Proportion

Frequentist inference is concerned withI point estimation,I interval estimation,I and hypothesis testing.



Point Estimation

A single statistic is calculated from the sample data and used toestimate the unknown parameter.The statistic depends on the random sample, so it is random, and itsdistribution is called its sampling distribution.We call the statistic an estimator of the parameter and the value ittakes for the actual sample data an estimate.There are various frequentist approaches for finding estimators, suchas

I Least Squares (LS),I maximum likelihood estimation (MLE) andI uniformly minimum variance unbiased estimation (UMVUE).

For estimating the binomial parameter θ, the LS, MLE and UMVUE ofthe population proportion is the sample proportion.



Bias

From a Bayesian perspective, point estimation means summarizing theposterior distribution by a single statistic, such as the posterior mean,median or mode. Here, we will use the posterior mean as the Bayesianpoint estimate (it minimizes the posterior mean squared error to give adecision-theoretic justification).

An estimator is said to be unbiased if the mean of its samplingdistribution is the true parameter, i.e. θ is unbiased if

E [θ] =

∫θf (θ|θ)d θ = θ,

where f (θ|θ) is the sampling distribution of the estimator θ given theparameter θ. The bias of an estimator θ is

bias(θ) = E [θ]− θ.

(Bayes estimators are usually biased.)Prof. Dr. Renate Meyer Applied Bayesian Inference 107


Mean Squared Error

An estimator is said to be a minimum variance unbiased estimator if noother unbiased estimator has a smaller variance. However, it ispossible that there may be a biased estimator that, on average, iscloser to the true value than the unbiased estimator. We need to lookat the possible trade-off between bias and variance.The (frequentist) mean squared error of an estimator θ is the averagesquared distance the estimator is away from the true value:

MS(θ) = E [(θ − θ)2] =

∫(θ − θ)2f (θ|θ)d θ.

One can show that

MS(θ) = bias(θ)2 + Var(θ).

Thus, it gives a better frequentist criterion for judging estimators thanthe bias or the variance alone.



MSE Comparison

We will now compare the mean squared error of the Bayesian and thefrequentist estimator of the population proportion θ.The frequentist estimator for θ is

θf =Xn,

where X , the number of successes in n trials, has the Binomial(n, θ)distribution with mean and variance given by

E(X ) = nθ and Var(X ) = nθ(1− θ)

Thus,

E [θf ] = θ

Var(θf ) =θ(1− θ)

nMS(θf ) = 02 +

θ(1− θ)

nProf. Dr. Renate Meyer Applied Bayesian Inference 109


MSE Comparison

Suppose we use the posterior mean as the Bayesian estimate for θ,where we use the Beta(1,1) prior (uniform prior), thenθB = 1+x

n+2 = xn+2 + 1

n+2

Thus, the mean of its sampling distribution is

E [θB] = nθn+2 + 1

n+2

and the variance of its sampling distribution is

Var(θB) = 1(n+2)2 nθ(1− θ)

Hence, the mean squared error is

MS(θB) =

(nθ

n + 2+

1n + 2

− θ)2

+1

(n + 2)2 nθ(1− θ)

=

[1− 2θn + 2

]2

+1

(n + 2)2 nθ(1− θ)



MSE ComparisonFor example, suppose θ = 0.4 and the sample size is n = 10. Then

MS(θf ) = 0.4×0.610 = 0.024

and

MS(θB) = 0.0169

Next, suppose θ = 0.5 and n = 10. Then

MS(θf ) = 0.025

and

MS(θB) = 0.01736



MSE Comparison

Figure 8 shows the mean squared error for the Bayesian and thefrequentist estimator as a function of θ. Over most (but not all) of therange, the Bayesian estimator (using uniform prior) performs betterthat the frequentist estimator.

theta

MSE

0.0 0.2 0.4 0.6 0.8 1.0

0.00.0

050.0

100.0

150.0

200.0

25

Bayesfrequentist

Figure 8: Mean squared error for the two estimates.



Interval Estimation

The aim is to find an interval (l ,u) that has a predetermined probabilityof containing the parameter

P(l ≤ θ ≤ u) = 1− α.

In the frequentist interpretation, the parameter is fixed but unknownand, before the sample is taken, the interval endpoints are randombecause they depend on the data. After the sample is taken, and theendpoints are calculated, there is nothing random, so the interval iscalled a confidence interval for the parameter. Under the frequentistparadigm, the correct interpretation for a (1− α)× 100% confidenceinterval is that (1− α)× 100% of the random intervals calculated thisway will contain the true value. Often, the sampling distribution of theestimator is approximately normal or tn−1 distributed with mean equalto the true value.



Confidence – Credible Interval

In this case, the confidence interval has the form

estimator ± critical value × stdev of estimator ,

where the critical value comes from the normal or t table. For thesample proportion, an approximate (1− α)× 100% confidence intervalfor θ is given by:

θf ± tn−1(α/2)

√θf (1− θf )

n.

A Bayesian credible interval for the parameter θ on the other hand, hasthe natural interpretation that we want. Because it is found from theposterior distribution of θ, it has the coverage probability we want forthis specific data.



Example: Interval Estimation

Example 3.1Out of a random sample of 100 Hamilton residents, x = 26 said theysupport a casino in Hamilton. Compare the frequentist 95% confidenceinterval with the Bayesian credible interval (using a uniform prior).

Frequentist 95% confidence interval:

0.26± 1.96×√

0.26× 0.74100

= (0.174,0.346)

Bayesian 95% credible interval:prior: Beta(1,1) posterior: Beta(1 + 26,1 + 74) =Beta(27,75)

> lu=qbeta(c(0.025,0.975),27,75)> lu[1] 0.1841349 0.3540134



Hypothesis Testing

Example 3.2Suppose we wish to determine whether a new treatment is better thanthe standard treatment. If so, θ, the proportion of patients who benefitfrom the new treatment, should be higher than θ0, the proportion whobenefit from the standard treatment. It is known from historical recordsthat θ0 = .6. A random group of 10 patients are given the newtreatment. X , the number who benefit from the treatment will beBinomial(n, θ). We observe x = 8 patients benefit. This is better thanwe would expect if θ = 0.6. But, is it sufficiently better for us toconclude that θ > 0.6 at the 5% level of significance?The following table gives the null distribution of X :

x 0 1 2 3 4 5 6 7 8 9 10f (x|θ0) .001 .0016 .0106 .0425 .1115 .2007 .2508 .2150 .1209 .0403 .0060



Frequentist Test

H0 : θ ≤ 0.6 H1 : θ > 0.6

Under H0: X |θ = 0.6 ∼ Binomial(10,0.6)

P-value

= P(X ≥ 8|H0 true)= P(X ≥ 8|θ = 0.6)= 1− pbinom(7,10,0.6)= 0.1209 + 0.0403 + 0.0060 = 0.1672 > 0.05 =⇒ not reject H0



Bayesian Test

prior: Beta(1,1)data: x = 8, n − x = 2posterior: Beta(9,3)

P(H0|x = 8) = P(θ ≤ 0.6|x = 8)

= pbeta(0.6,9,3)

= 0.1189






3 Conjugate Distributions 3.7 Exponential Distribution

Exponential data

The exponential distribution is commonly used to model “waiting times”and other continuous positive real-valued random variables, usuallymeasured on a time scale. The sampling distribution of an outcome x ,given parameter θ, is

f (x |θ) = θ exp(−xθ), for x > 0.

The exponential distribution is a special case of the Gammadistribution with parameters (α, β) = (1, θ).



Gamma Prior

Let X1, . . . ,Xn be iid Exponential(θ) random variables.Likelihood:

f (x|θ) ∝ θn exp(−nxθ)

conjugate Gamma(α, β) prior:

f (θ) =βα

Γ(α)θα−1 exp(−βθ)

Posterior density:

f (θ|x) ∝ θn+α−1 exp(−θ(nx + β)) ∼ Gamma(α + n, β + nx)



Exponential Example

Example 3.3Let Yi , i = 1, . . . ,n, be iid exponentially distributed.

i) Using a conjugate Gamma(α, β) distribution, derive the theposterior mean, variance, and mode of θ. For which values α andβ does the posterior mode coincide with the ML estimate of θ?

ii) What is the posterior density of the mean φ = 1θ? Which

distribution is conjugate for φ?



Exponential Example

iii) The length of life of a light bulb manufactured by a certain processhas an exponential distribution with unknown rate θ. Suppose theprior distribution for θ is a Gamma distribution with coefficient ofvariation 0.5.( The coefficient of variation is defined as the standard deviationdivided by the mean.)

A random sample of light bulbs is to be tested and the lifetime ofeach obtained. If the coefficient of variation of the distribution of θis to be reduced to 0.1, how many light bulbs need to be tested?

iv) In part iii), if the coefficient of variation refers to φ instead of θ, howwould your answer be changed?


3 Conjugate Distributions 3.8 Poisson Distribution

Poisson Data

Let X be the number of times a certain event occurs in a unit interval oftime and the following conditions hold

I The events are occurring at a constant average rate of θ per unittime.

I The number of events in any one interval of time is statisticallyindependent of the number in any other nonoverlapping interval.

I The probability of more than one event occurring in an interval oflength d goes to zero as d goes to zero.

Any process producing events which satisfy the above three axioms iscalled a Poisson process and X , the number of events in a unit timeinteral, is distributed as Poisson(θ).



Gamma Prior

Let X be a Poisson(θ) random variable and we observe X = x .Likelihood:

f (x |θ) =θx e−θ

x!

∝ θx e−θ

Conjugate Gamma(α, β) prior:

f (θ) =βα

Γ(α)θα−1 exp(−βθ)

∝ θα−1 exp(−βθ)




Posterior density:

f (θ|x) ∝ f (θ)f (x |θ)

∝ θα−1e−βθθx e−θ

∝ θα+x−1e−θ(β+1)

i.e. pdf of Gamma(α + x , β + 1)




Prior predictive distribution for X : f (x)

f (x) =

∫ ∞0

f (x |θ)f (θ)dθ

=

∫ ∞0

θx e−θ

x!

βα

Γ(α)θα−1e−βθdθ

=βα

Γ(α)

1x!

∫ ∞0

θα+x−1e−(β+1)θdθ

=βα

Γ(α)

1x!

Γ(α + x)

(β + 1)α+x

=βα

(β + 1)α(β + 1)x(α + x − 1)!

(α− 1)!x!

=

(β

β + 1

)α( 1β + 1

)x (α + x − 1

x

)Prof. Dr. Renate Meyer Applied Bayesian Inference 132


Negative Binomial

i.e. Negative-Binomial(α, β)

i.e. the no. of Bernoulli failures obtained before the α’th success whenthe success probability is p = β

β+1

which shows

Neg-bin(x |α, β) =

∫Poisson(x |θ)Gamma(θ|α, β)dθ



Multiple Poisson Data

Now let X1, . . . ,Xn be iid Poisson(θ) random variables. Suppose weobserve x = (x1, . . . , xn).

Likelihood:

f (x|θ) =n∏

i=1

f (xi |θ)

=n∏

i=1

θxi e−θ

xi !

=1∏n

i=1 xi !θ∑n

i=1 xi e−nθ

∝ θnx e−nθ



Multiple Poisson Data

Conjugate Gamma(α, β) prior:

f (θ) ∝ θα−1 exp(−βθ)

Posterior density:

f (θ|x) ∝ f (θ)f (x |θ)

∝ θα−1e−βθθnx e−nθ

∝ θα+nx−1e−θ(β+n)

i.e. pdf of Gamma(α + nx , β + n)



Poisson Example

Example 3.4Suppose that causes of death are reviewed in detail for a city in the USfor a single year. It is found that 3 persons, out of a population of200,000, died of asthma, giving a crude estimated asthma mortalityrate in the city of 1.5 per 100,000 persons per year. A Poissonsampling model is often used for epidemiological data of this form. Letθ represent the true underlying long-term asthma mortality rate in thecity (measured in cases per 100,000 persons per year). Reviews ofasthma mortality rates around the world suggest that mortality ratesabove 1.5 per 100,000 people are rare in Western countries, withtypical asthma mortality rates around 0.6 per 100,000.

a) Construct a conjugate prior density and derive the posteriordistribution of θ.



Poisson Example

b) What is the posterior probability that the long-term death rate fromasthma in the city is more than 1.0 per 100,000 per year?

c) What is the posterior predictive distribution of a future observationY ?

d) To consider the effect of additional data, suppose that ten years ofdata are obtained for the city in this example with y = 30 deathsover 10 years. Assuming the population is constant at 200,000,and assuming the outcomes in the ten years are independent withconstant long-term rate θ, derive the posterior distribution of θ.

e) What is the posterior probability that the long-term death rate fromasthma in the city is more than 1.0 per 100,000 per year?



Poisson Example



Poisson Example



Poisson Example


3 Conjugate Distributions 3.9 Normal Distribution

Normal data, known variance, single data

A random variable X has a Normal distribution with mean µ andvariance σ2 if X has a continuous distribution with pdf

f (x) =1√2πσ

exp

[−1

2

(x − µσ

)2]

for −∞ < x <∞.



Normal Example

Example 3.5According to Kennett and Ross (1983), Geochronology, London:Longmans, the first apparently reliable datings for the age of Ennerdalegranophyre were obtained from the K/Ar method (which depends onobserving the relative proportions of potassium-40 and argon-40 in therock) in the 1960s and early 1970s, and these resulted in an estimateof 370± 20 million years. Later in the 1970s, measurements based onthe Rb/Sr method (depending on the relative proportions ofrubidium-87 and strontium-87) gave an age of 421± 8 million years. Itappears that the errors marked are meant to be standard deviations,and it seems plausible that the errors are normally distributed. If then ascientist had the K/Ar measurements available in the early 1970s, thiscould be the basis of her prior beliefs about the age of these rocks.



Normal Prior

Likelihood: X |µ ∼ N(µ, σ2), σ2 known

f (x |µ) =1√2πσ

exp(− 1

2σ2 (x − µ)2)∝ exp

(− 1

2σ2 (x − µ)2)

Conjugate prior: µ ∼ N(µ0, σ20), µ0, σ

20 are hyperparameters

f (µ) =1√

2πσ0exp

(− 1

2σ20

(µ− µ0)2

)∝ exp

(− 1

2σ20

(µ− µ0)2

)




Posterior: µ|x ∼ N(µ1, σ21)

where

µ1 =

1σ2

0µ0 + 1

σ2 x1σ2

0+ 1

σ2

1σ2

1=

1σ2

0+

1σ2

NB: posterior precision = prior precision + data precisionposterior mean = weighted average of prior mean and observation




f (µ|x)∝ f (x |µ)f (µ)

∝ exp

(−1

2

[(x − µ)2

σ2 +(µ− µ0)2

σ20

])

∝ exp

(− 1

2σ2 (x2 − 2xµ+ µ2)− 12σ2

0(µ2 − 2µµ0 + µ2

0)

)

∝ exp

(−1

2

[µ2

(1σ2 +

1σ2

0

)− 2µ

(xσ2 +

µ0

σ20

)+ const .

])

∝ exp

−12

1(1σ2 + 1

σ20

)−1

µ2 − 2µ

xσ2 + µ0

σ20

1σ2 + 1

σ20

+ const .

∝ exp

−12

1(1σ2 + 1

σ20

)−1

µ− xσ2 + µ0

σ20

1σ2 + 1

σ20

2



Posterior Mean Expressions

Alternative expressions for posterior mean:

µ1 = µ0 + (x − µ0)σ2

0

σ2 + σ20

prior mean adjusted towards observed value

µ1 = x − (x − µ0)σ2

σ2 + σ20

data shrunk towards prior mean




Prior predictive distribution of X : X ∼ N(µ0, σ2 + σ2

0)

Because:

f (x) =∫

f (x |µ)f (µ)dµ

f (x , µ) = f (x |µ)f (µ) ∝ exp(− 1

2σ2 (x − µ)2 − 12σ2

0(µ− µ0)2

)i.e. (X , µ) have a bivariate normal distributionthen the marginal distribution of X is normal

Now:

E [X ] = E [E [X |µ]] = E [µ] = µ0

Var(X ) = E [Var(X |µ)] + Var(E [X |µ])

= E [σ2] + Var(µ) = σ2 + σ20



Reminder: Conditional Mean and Variance

If U and V are random variables, then

E [U] = E [E [U|V ]]

Var(U) = E [Var(U|V )] + Var(E [U|V ])




Posterior predictive distribution of future Y :Y |x ∼ N(µ1, σ

2 + σ21) Because:

f (y |x) =∫

f (y |µ)f (µ|x)dµ

f (y , µ|x) = f (y |µ)f (µ|x) ∝ exp(− 1

2σ2 (y − µ)2 − 12σ2

1(µ− µ1)2

)i.e. (Y , µ)|x have a bivariate normal distributionthen the marginal distribution of Y |x is normalNow:E [Y |x ] = E [E [Y |µ]|x ] = E [µ|x ] = µ1

Var(Y |x) = E [Var(Y |µ)|x ] + Var(E [Y |µ]|x)

= E [σ2|x ] + Var(µ|x) = σ2 + σ21



Normal Example

Now back to Example 3.5:

Single Normal observation, Normal prior

mu

dens

ity

300 350 400 450

0.00.0

10.0

20.0

30.0

40.0

5

priorposteriorlikelihood

Figure 9: Conjugate Normal prior and single observation.



Normal data, known variance, multiple data

Example 3.6What is now called the National Institute of Standards and Technology(NIST) in Washington DC conducts extremely high precisionmeasurement of physical constants, such as the actual weight ofso-called check-weights that are supposed to serve as referencestandards (like the official kg). In 1962-63, for example, n = 100weighings of a block of metal called NB10, which was supposed toweigh exactly 10g, were made under conditions as close to iid aspossible. The 100 measurements x1, . . . , xn (the units are microgramsbelow 10g) have a mean of x = 404.6 and a SD of s = 6.5.



Normal Example

weight frequency weight frequency375 1 406 12392 1 407 8393 1 408 5397 1 409 5398 2 410 4399 7 411 1400 4 412 3401 12 413 1402 8 415 1403 6 418 1404 9 423 1405 5 437 1



Normal Example

Questions:

1. How much does NB10 really weigh?2. How certain are you given the data that the true weight of NB10 is

less than 405.25 µ g below 10g?3. What is the underlying accuracy of the NB10 measuring process?4. How accurately can you predict the 101st measurement?

A Normal qqplot shows that a Normal sampling distribution isappropriate. We first assume that σ2 is known.



Calculting Posterior

Likelihood: Xi |µiid∼ N(µ, σ2), i = 1, . . . ,n, σ2 known

Conjugate prior: µ ∼ N(µ0, σ20), µ0, σ

20

Posterior: µ|x ∼ N(µn, σ2n)

where

µn =

1σ2

0µ0 + n

σ2 x1σ2

0+ n

σ2

1σ2

n=

1σ2

0+

nσ2




Why?

Reduction to the case of single data point of previous section:

If X1, . . . ,Xn|µiid∼ N(µ, σ2)

Likelihood: f (x1, . . . , xn|θ)

=∏n

i=1 f (xi |µ)=∏n

i=11√2πσ

exp[−1

2

(xi−µσ

)2]

= const. × exp[−1

2∑n

i=1( xi−µ

σ

)2]∝ . . . ∝exp

[−1

2

(x−µσ/√

n

)2]




∝ f (x |µ)

The likelihood depends on data x1, . . . , xn only through the sufficientstatistic x

and X |µ ∼ N(µ, σ2/n).

Thus, in previous section 3.9, simply substitute σ2 by σ2/n and x by x .



Remarks

1. If σ20 = σ2 then

µn =µ0 + nxn + 1

=µ0 +

∑xi

n + 11σ2

n=

n + 1σ2

i.e. prior has weight of one additional observation with value µ0.

2. If n large, the posterior is determined by x and σ2.

3. If σ20 −→∞ (diffuse prior) and n fixed, then

µ|x ∼ N(x , σ2/n)

posterior mean = MLE



Remarks

4. The prior info is equivalent to σ2

σ20

additional observations all equalto µ0 since

µn =

1σ2

0µ0 + n

σ2 x1σ2

0+ n

σ2

=

σ2

σ20µ0 +

∑xi

σ2

σ20

+ n



Back to Normal Example







Multiple Normal observations, Normal prior

mu

dens

ity

360 380 400 420 440

0.00.1

0.20.3

0.40.5

0.6

priorposteriorlikelihood

Figure 10: Conjugate Normal prior and several observations.



Normal data, known variance, noninformative prior

Example 3.7Changes in blood pressure (in mmHg) were recorded for each of 100

patients, where negative numbers are decreases while on the drugand positive numbers are increases:

+3.7− 6.7− 10.5 . . .− 16.7− 7.2

with sample mean x = −7.99 and standard deviation s = 4.33.

We will assume that the change in blood pressure X has a Normaldistribution with unknown mean µ and known variance σ2 = 4.332.



Example

Let us assume that we don’t know anything about the mean change inblood pressure induced by the new drug and thus assume that θ canattain any real value with equal probability. This gives a flat priordistribution for µ on (−∞,∞), i.e.

f (µ) ∝ 1.

(There is no “proper” continuous uniform distribution on (−∞,∞), butyou can think of µ being uniform on some finite interval (−a,a), forsome large a and ignore the normalization constant, as it is notneeded for the application of Bayes’ theorem).

What is the posterior distribution of µ?




Posterior pdf: f (µ|x)

∝ prior × likelihood

∝ f (µ)f (x |µ)

∝ 1× exp[−1

2

(x−µσ/√

n

)2]

∝ exp[−1

2

(µ−xσ/√

n

)2]

∝ pdf of Normal(x , σ2/n)



Simple Updating Rule

If Xiiid∼ Normal(µ, σ2), i = 1, . . . ,n and a flat prior is used, then the

posterior distribution of µ|x is

Normal(θn, σ2n) with

θn = x and

σ2n = σ2/n.

In Example 3.7

θn = −7.99

σ2n = 4.332/100 = 0.187489



Credible Intervals

95% posterior probability interval for µ:

µL = 2.5% quantile of N(−7.99,0.187489)

µU = 97.5% quantile of N(−7.99,0.187489)

In R:

> lu=qnorm(c(0.025,0.975),-7.99,sqrt(0.187489))> lu[1] -8.838664 -7.141336



Hypothesis Test

Test the null hypothesis H0 : µ ≤ −7.0.

P(H0|x) = P(µ ≤ −7.0|x)

In R:

> p=pnorm(-7,-7.99,sqrt(0.187489))> p[1] 0.9888838





2-Parameter Normal with Conjugate Prior

prior distribution:

µ|σ2 ∼ N(µ0, σ2/κ0)

σ2 ∼ Inv-χ2(ν0, σ20)

where Inv-χ2(ν0, σ20) denotes the scaled inverse χ2-distribution with

scale σ20 and ν0 degrees of freedom, i.e. the distribution of σ2

0ν0/Zwhere Z is a χ2 random variable with ν0 degrees of freedom.joint prior density:

f (µ, σ2) ∝ σ−1(σ2)−(ν0/2+1) exp(− 1

2σ2 [ν0σ20 + κ0(µ0 − µ)2]

)∼ N-Inv-χ2(µ0, σ

20/κ0, ν0, σ

20)




joint posterior density:

f (µ, σ2|x) ∝ σ−1(σ2)−(ν0/2+1) exp(− 1

2σ2 [ν0σ20 + κ0(µ0 − µ)2]

)×(σ2)−n/2 exp

(− 1

2σ2 [(n − 1)s2 + n(x − µ)2]

)∼ N-Inv-χ2(µn, σ

2n/κn, νn, σ

2n)

where

µn =κ0

κ0 + nµ0 +

nκ0 + n

x

κn = κ0 + nνn = ν0 + n

νnσ2n = ν0σ

20 + (n − 1)s2 +

κ0nκ0 + n

(x − µ0)2.




conditional posterior of µ:

µ|σ2,x ∼ N(µn, σ2/κn)

= N

(κ0σ2µ0 + n

σ2 xκ0σ2 + n

σ2

,1

κ0σ2 + n

σ2

)

marginal posterior of σ2:

σ2|x ∼ Inv-χ2(νn, σ2n)

marginal posterior of µ:

f (µ|x) ∝[1 +

κn(µ− µn)2

νnσ2n

]−(νn+1)/2

∼ tνn (µ|µn, σ2n/κn)




3 Conjugate Distributions 3.10 Normal Linear Regression

Normal Linear Regression

This can be extended to linear regression models:

Sampling distribution:

Yi |µi , σ2 ∼ N(µi , σ

2), i = 1, . . . ,n,

with µi = β0 + β1xi1 + · · ·βp−1xi,p−1 = x′iβ

or in matrix notation with n × p design matrix X (with rows xi ):

Y|β, σ2 ∼ Nn(Xβ, σ2In)

where

X =

1 x11 x12 . . . x1,p−11 x21 x22 . . . x2,p−1...

......

......

1 xn1 xn2 . . . xn,p−1

β =

β0β1...βp

.



Conjugate Normal-Inverse-Gamma Prior

The multivariate normal-inverse gamma prior distribution(β, σ2) ∼ NIG(µβ,V,a,b) is conjugate and can be specified as:

β|σ2 ∼ Np(µβ, σ2V) and σ2 ∼ Inv-Gamma(a,b).

Posterior is NIG(µβ, V, a, b) with

β = Σ(X′y + V−1µβ)

Σ = (X′X + V−1)−1

a =n2

+ a

b =SS2

+ b

SS = y′y− β′Σ−1β + µ′βV−1µβ.



Weighted Average

β can be written as a weighted average of prior and sample mean asin the univariate normal case:

β = Wβ + (Ip −W)µβ with W = (X′X + V−1)−1X′X

where β = (X′X)−1X′y is the MLE.

The marginal posterior distribution of β is a multivariate Studentdistribution. For details, see Bernardo and Smith (1994).

The marginal posterior distribution of σ2 is an Inverse Gammadistribution with parameters a and b above.




4 WinBUGS Applications 4.1 WinBUGS Handouts

WinBUGS Applications: Overview

Calculation of the posterior distribution is difficult in situations with:I nonconjugate priorsI multiple parameters

as we need to calculate summary statistics, like mean and variance,and in high-dim. problems, calculate marginal posterior distributions.All this involves integration, which has been a very big hurdle forBayesian inference in the past.

For low parameter dimensions, say 2,3,4,5 numerical integrationtechniques, asymptotic approximations etc may be used but thesebreak down for higher dimensions.




The most successful approach, for reasons that we will discuss in thesubsequent sections, is based on simulation. That means, instead ofexplicitly calculating the posterior and performing integrations, wegenerate a sample from the posterior distribution and use that sampleto approximate any quantity of interest, e.g. approximate the posteriormean by the sample mean etc.A very versatile software to do these posterior simulations isWinBUGS, the Windows version of BUGS (Bayesian inference UsingGibbs Sampling), developed by David Spiegelhalter and colleagues atthe MRC Biostatistics Unit of Cambridge University, England.




WinBUGS uses the Gibbs sampler to generate samples from theposterior distribution of parameters of a Bayesian model. We willdiscuss the Gibbs sampler and other Markov chain Monte Carlotechniques in detail in Chapter 6. For now, we simply consider thesimulation method used in WinBUGS as a black box but simply keep inmind, that the samples generated are not independent but dependent,i.e. they are samples from a Markov chain that converges towards theposterior distribution. Therefore, we can use the samples only from apoint in time where convergence has set in and need to discard theinitial so-called burn-in samples.



WinBUGS Handouts

We illustrate this sampling-based approach using our familiar exampleof Binomial data with a conjugate prior distribution and refer to thehandout

Brief Introduction to WinBUGS

Other handouts will discuss running WinBUGS in batch mode, fromwithin R using R2WinBUGS and how to use the R package CODA forconvergence diagnostics.Once familiar with WinBUGS, we will look at the huge range ofBayesian models, especially Bayesian hierarchical models, that can behandled with WinBUGS and concentrate on practical implementationissues rather than theory. The underlying theory will be recouped inthe subsequent chapters.


4 WinBUGS Applications 4.2 Bayesian Linear Regression

Simple Linear Regression

In regression analysis, we look at the conditional distribution of theresponse variable at different levels of a predictor variable

I Response variable YI also called dependent or outcome variableI what we want to explain or predictI in simple linear regression, the response variable is continuous

I Predictor variables X1, . . . ,XpI also called independent variables or covariatesI in simple linear regression, the predictor variable is usually

continuousI which variable is response and which is predictor depends on our

research question



Example

Example 4.1This example investigates the quality of the delivery system network ofa softdrink company, see Example 5.1 in Ntzoufras (2009). One isinterested in estimating the time each employee needs to refill anautomatic vending machine owned and served by the company. Forthis reason, a small quality assurance study was set up by an industrialengineer of the company. The response variable is the total servicetime (measured in minutes) of each machine, including its stockingwith beverages and any required maintenance or housekeeping. Afterexamining the problem, the industrial engineer recommends twoimportant variables that affect delivery time: the number of cases ofstocked products and the distance walked by the employee (measuredin feet). A dataset of 25 observations was finally collected.



Data: Softdrink Delivery Times

Delivery Time Cases Distance16.68 7 56011.5 3 22012.03 3 34014.88 4 8013.75 6 15018.11 7 3308 2 11017.83 7 21079.24 30 146021.5 5 60540.33 16 68821 10 21513.5 4 25519.75 6 46224 9 44829 10 77615.35 6 20019 7 1329.5 3 3635.1 17 77017.9 10 14052.32 26 81018.75 9 45019.83 8 63510.75 4 150



Model Assumptions

The explanatory variables are assumed fixed, their values denoted byxi1, . . . , xip for i = 1, . . . ,n. Given the values of the explanatoryvariables, the observations of the response variable are assumedindependent, normally distributed

Yi |xi1, . . . , xip ∼ N(µi , σ2) with

µi = β0 + β1xi1 + · · ·+ βpxip for i = 1, . . . ,n

or in matrix notation:

Y|x ∼ Nn(µ, σ2I) withµ = Xβ

where σ2 and β = (β0, β1, . . . , βp) is the set of regression parameters, Idenotes the identity matrix, Y the vector of observations and X = (xij)the n × (p + 1) design matrix.



Likelihood Specification in WinBUGS

Note that in WinBUGS the normal distribution is parametrized in termsof the precision τ = 1

σ2 . The likelihood is thus specified by:

for (i in 1:n){y[i] ~ dnorm(mu[i],tau)mu[i] <- beta0 + beta1*x1[i] + ... + betap*xp[i]

}sigma2 <- 1/tausigma <- sqrt(sigma2)



Prior Specification

In normal regression models, the simplest approach is to assume thatall parameters are a priori independent, i.e.

f (β, τ) =

p∏j=0

f (βj)f (τ)

βj ∼ N(µj , c2j ) for j = 0, . . . ,p

τ ∼ Gamma(a,b)

Thus, the precision has a prior mean of E(τ) = ab and prior variance

Var(τ) = ab2 . This corresponds to an Inverse Gamma prior distribution

for σ2 with E(σ2) = ba−1 and Var(σ2) = b2

(a−1)2(a−2).

No info about βj : µj = 0 and c2j = 10000.

No info about τ : a = b = 0.001



Prior Specification in WinBUGS

beta0 ~ dnorm(0.0,1.0E-4)beta1 ~ dnorm(0.0,1.0E-4)....betap ~ dnorm(0.0,1.0E-4)tau ~ dgamma(0.001,0.001)



Interpretation of Regression Coefficients

Each regression coefficient βj measures the effect of the explanatoryvariable Xj on the expected value of the response variable Y adjustedfor the remaining covariates.Questions of interest are

1. Is the effect of Xj important for the description of Y ?2. What is the association between Y and Xj (positive or negative)?3. What is the magnitude of the effect of Xj on Y?



Interpretation of Regression Coefficients

Answers:1. Look at posterior distribution of βj and its credible interval. Does

the credible interval contain 0?

2. Calculate the posterior probability P(βj > 0) and P(βj < 0).In WinBUGS, use the step function

p.betaj <- step(betaj)

which creates a binary node p.betaj taking values 1 if βj > 0and 0 otherwise.

3. Posterior mean/median of βj is a measure of the posteriorexpected change of the response variable Y if Xj increases by 1unit and all other covariates are fixed.



Interpretation of β0

β0 measures the posterior expected value of Y if all covariates arezero. Often, zero is not in the range of the covariates, and thus theinterpretation of β0 is not meaningful.Example: response: heart rate, covariate: body temperature indegrees CBetter: Center the covariates at their mean xc

ij = xij − xj

µi = βc0 + βc

1(xi1 − x1) + · · ·+ βc1(xip − xp)

βc0 = expected value of Y when all covariates are equal to their means

Centering the covariates is also advisable from a computational pointof view: it decreases the posterior correlation between parameters andthus improves convergence of the Gibbs sampler. We will show this inSection 6.



Regression Example in WinBUGS

Prepare the data file by including variable names to be used byWinBUGS at the top of each column and END at the end and save asplain text file softdrinkdata.txt in your working directory.

time[] cases[] distance[]16.68 7 56011.5 3 22012.03 3 34014.88 4 8013.75 6 150...

......

35.1 17 77017.9 10 14052.32 26 81018.75 9 45019.83 8 63510.75 4 150END

For some odd reason (bug in WinBUGS?), make sure there is a blankline after END.



Regression Example in R

Alternatively, if we want to fit a linear model in the frequentist way in Rfirst, to compare later on with the Bayesian results in WinBUGS, weread in the data, fit a linear model and output a list using dput(),using the following R commands:

softdrink <- read.table(file="softdrinkdata.txt",header=TRUE, sep="")attach(softdrink)cases_cent<- cases - mean(cases)distance_cent <- distance - mean(distance)summary(lm(time ~ cases_cent + distance_cent))dput(list(time=time,cases=cases,distance=distance)

"softdrinkdatalist.txt")



Regression Output in R

Call:lm(formula = time ~ cases_cent + distance_cent)Residuals:

Min 1Q Median 3Q Max-5.7880 -0.6629 0.4364 1.1566 7.4197

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 22.384000 0.651895 34.337 < 2e-16 ***cases_cent 1.615907 0.170735 9.464 3.25e-09 ***distance_cent 0.014385 0.003613 3.981 0.000631 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.259 on 22 degrees of freedomMultiple R-squared: 0.9596, Adjusted R-squared: 0.9559F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16



Regression Model in WinBUGS

model{# likelihoodfor (i in 1:n){

time[i] ~ dnorm( mu[i],tau)mu[i] <- beta0 + beta1*(cases[i]-mean(cases[])) +

beta2* (distance[i]-mean(distance[]))}

# prior distributionstau ~ dgamma(0.001,0.001)beta0 ~ dnorm(0.0,1.0E-4)beta1 ~ dnorm(0.0,1.0E-4)beta2 ~ dnorm(0.0,1.0E-4)



Regression Model in WinBUGS

# definition of sigma, sigma2, and sd(Y)sigma2 <- 1/tausigma <- sqrt(sigma2)#calculation of Bayesian version of RsquaredR2B <- 1-sigma2/pow(sd(time[]),2)#posterior probabilitiesp.beta0 <- step(beta0)p.beta1 <- step(beta1)p.beta2 <- step(beta2)}#initslist(tau=1,beta0=1, beta1=0, beta2=0)



Regression Output in WinBUGS

node mean sd MC error 2.5% median 97.5% start sampleR2B 0.9516 0.01732 7.742E-4 0.9063 0.9551 0.9737 1001 1000beta0 22.37 0.6681 0.02255 21.15 22.35 23.78 1001 1000beta1 1.61 0.1851 0.005237 1.254 1.606 1.992 1001 1000beta2 0.01447 0.003931 1.263E-4 0.006683 0.0144 0.02251 1001 1000p.beta0 1.0 0.0 3.162E-12 1.0 1.0 1.0 1001 1000p.beta1 1.0 0.0 3.162E-12 1.0 1.0 1.0 1001 1000p.beta2 1.0 0.0 3.162E-12 1.0 1.0 1.0 1001 1000sigma2 11.67 4.175 0.1866 6.364 10.82 22.7 1001 1000



Bayesian Coefficient of Variation

A high value of the precision τ (low σ2) indicates that the model canaccurately predict the expected value of Y . We can rescale thisquantity using the sample variance of the response variable Y , s2

Y ,using the R2

B statistic given by:

R2B = 1− τ1

s2Y

= 1− σ2

s2Y.

This quantity can be interpreted as the proportional reduction ofuncertainty concerning the response variable Y achieved byincorporating the explanatory variables Xj in the model.



Bayesian Coefficient of Variation

It can be regarded as the Bayesian analog of the adjusted coefficientof determination

R2adj = 1− σ2

s2Y,

where

σ2 =1

n − p

n∑i=1

(yi − yi)2 with yi = β0 +

p∑i=1

xij βj ,

where βj are the maximum likelihood estimates of βj .



Missing Data

Missing data are easily incorporated in a Bayesian analysis.They are treated as unknown parameters to be estimated.Assume for instance, that observation 21 in the linear regressionExample 4.1 was missing, i.e. time[21] for cases[21]=10 anddistance[21]=140 was missing.In WinBUGS, missing values are denoted by NA in the dataset.Substituting 17.9 in the dataset by NA and running the code again, nowmonitoring the node time[21], we get the output

node mean sd MC error 2.5% median 97.5% start sampletime[21] 21.06 3.696 0.0821 13.71 21.02 28.4 1001 2000



Prediction in WinBUGS

Predicting future observations that follow the same distributionalassumptions as the observed data is straightforward. In the regressioncontext, we are interested in the posterior predictive distribution of afuture observation Yn+1|y1, . . . , yn with certain values of the predictorsx. Its posterior predictive pdf is

f (yn+1|y,x) =

∫f (yn+1|β,x)f (β|y,x)dβ

or ignoring the dependence on x:

f (yn+1|y) =

∫f (yn+1|β)f (β|y)dβ

and we can use the mixture method (to be discussed in Chapter 7) tosimulate from this distribution. This is easily implemented in WinBUGS.



Prediction in WinBUGS

In the linear regression Example 4.1, this means defining anothervariable in the code with the same distribution as the original data andvalues for the predictor variables for which we want to forecast, e.g.cases=20 and distance=1000, and including this variable in thedataset with value NA:

pred.time ~ dnorm( pmu,tau)pmu<- beta0 + beta1*(20-mean(cases[])) +

beta2* (1000-mean(distance[]))

Running the model again and monitoring pred.time gives theposterior predictive summary:

node mean sd MC error 2.5% median 97.5% start samplepred.time 48.98 3.71 0.07796 41.73 49.0 56.56 1001 2000




4 WinBUGS Applications 4.3 Model Checking

Model Assessment

Having successfully fit a model to a given dataset, the statistician mustbe concerned with whether the fit is adequate, and whether theassumptions made by the model are justified. For example, in standardlinear regression, the assumptions of normality, independence,linearity, and homogeneity of variance must all be investigated.

Several authors have suggested using the marginal distribution of thedata, p(y), in this regard. Observed yi values for which p(yi) is smallare “unlikely", and therefore may be considered outliers under theassumed model. Too many small values of p(yi) suggest the modelitself is inadequate, and should be modified and expanded.



Model Assessment

A problem, with this approach is the difficulty in defining how small is“small" and how many outliers are “too many". In addition, we have theproblem of the possible impropriety of p(y) under noninformativepriors.As such, we might work with the predictive distributions, since they willbe proper whenever the posterior is.



Model Checking

Checking validity of model assumptions:I examination of individual observationsI comparison between two or more competitor models (later)I global goodness-of-fit checks



Examination of Individual Observations

Consider data y1, . . . , yn and parameters θ under the assumed model.Gelfand et al. (1992) suggest a series of "checking functions". Theseare based on comparing a predictive distribution p(Y rep

i ) (to be madeprecise in the following) with the actual observed yi :

1. the residuals: yi − E [Y repi ]

2. the standardised residuals:yi − E [Y rep

i ]√Var(Y rep

i )

3. the chance of getting a more extreme observation:

min(P(Y repi < yi),P(Y rep

i ≥ yi))

4. the chance of getting more ’surprising’ observation:

P(Y repi : f (Y rep

i ) ≤ f (yi))

5. the predictive ordinate of the observation: f (y repi )



Separate Evaluation Data Available

Assume the data has been divided into a ’training set’ z and an’evaluation set’ y. Then the posterior distribution of θ is based on z andthe predictive distribution above is given by

f (yi |z) =

∫f (yi |z, θ)f (θ|z)dθ

As usually the yi ’s are conditionally independent of the zis given θ, thisbecomes

f (yi |z) =

∫f (yi |θ)f (θ|z)dθ



Separate Evaluation Data Available

In WinBUGS, calculating the predictive distribution just requiresdefining an additional node of each Y rep

i with the appropriate parentsand monitoring the Y rep

i ’s.The observed yi can then be compared with their predictive distributionthrough the residuals or standardized residuals

ri = yi − E [Y repi |z] and sri =

yi − E [Y repi |z]√

Var(Y repi |z)

I Plotting these residuals versus fitted values might reveal failure ina normality or homogeneity of variance assumption.

I Plotting them versus time could reveal a failure of independence.

I Summing their squares or absolute values could provide anoverall measure of fit.



No Separate Evaluation Data Available

The above discussion assumes the existence of two independent datasamples, which may well be unavailable in many problems. As such,Gelfand et al. (1992) suggested a cross-validation approach, whereinthe fitted value for y rep

i is computed conditionally on all the data exceptyi , namely y(i) = (y1, . . . , yi−1, yi+1, . . . , yn).That is, the i th residualbecomes

ri = yi − E [Y repi |y(i)],

and the i th standardized residual

sri =yi − E [Y rep

i |y(i)]√Var(Y rep

i |y(i)).



Cross-Validation Approach

Note that in this cross-validatory approach we compute the posteriormean and variance with respect to the conditional predictivedistribution,

p(yi |y(i)) =p(y)

p(y(i))=

∫p(yi |θ,y(i))p(θ|y(i))dθ,

which gives the likelihood of each point given the remainder of thedata.The actual values of p(yi |y(i)), referred to as the conditional predictiveordinate, or CPO, can be plotted versus i as an outlier diagnostic,since data values having low CPO are poorly fit by the model.



Cross-Validation Approach

Unfortunately, this is generally difficult to do within WinBUGS.But an approximation to the cross-validatory method is to use themethods for a separate evaluation set, but replacing z by y. Hence ourpredictive distribution becomes the posterior predictive density withoutcase omission

f (y repi |y) =

∫f (y rep

i |y, θ)f (θ|y)dθ

=

∫f (y rep

i |θ)f (θ|y)dθ

If we do wish to sample from the correct cross-validatory predictivedistribution, this can be carried out using an additional importancesampling step to remove the effect of yi when repredicting Y rep

i(Gelfand et al., 1992), although this would have to be carried outexternal to WinBUGS.



WinBUGS Cross-Validation

Let us implement checking functions 1 and 2 in WinBUGS for theExample 4.1 using the approximate cross-validatory method.Note that

E [Y repi |y] =

∫y rep

i f (y repi |y)dy rep

i

=

∫y rep

i

(∫f (y rep

i |θ)f (θ|y)dθ)

dy repi

=

∫ (∫y rep

i f (y repi |θ)dy rep

i

)f (θ|y)dθ

= E [µi |y]

i.e. the posterior mean of µi = β0 + β1xi1 + β2 ∗ xi2. Similarly,Var(Y rep

i |y) = posterior mean of τ .



WinBUGS Cross-Validation

Thus, in WinBUGS we only need to define the following nodes:

for (i in 1:n){r[i]<- time[i]-mu[i]sr[i]<- (time[i]-mu[i])*sqrt(tau)}



Examination of Individual Observations in WinBUGS

Monitoring these vectors r and sr, we can look at summary statisticsetc. However, we get a better overview by using the comparison tool ofthe Inference menu and clicking on "boxplot":

[1]

[2][3]

[4]

[5] [6]

[7] [8]

[9]

[10]

[11]

[12]

[13] [14][15] [16] [17]

[18]

[19]

[20]

[21]

[22]

[23][24]

[25]

box plot: sr

-4.0

-2.0

0.0

2.0

4.0

Figure 11: Boxplot of standardized residuals.



Checking Function 3 in WinBUGS

To compute P(Y repi < yi), we first need to obtain sample values of the

random variable Y repi by generating a replicate dataset time.rep[i]

which depends on the current values of mu[i] and tau at eachiteration.The step() function is then used to calculate the variablep.smaller[i] which takes the value 1 if time[i]-time.rep[i]≥ 0 and zero otherwise.The posterior mean of p.smaller[i] is simply the proportion ofiterations for which time.rep[i] < time[i].P(Yi ≥ yi) = 1− posterior mean of p.smaller. The chance ofobserving a more extreme value for Yi is thus the minimum of thesetwo probabilities.




node mean sd MC errorp.smaller[1] 0.077 0.2666 0.005964p.smaller[2] 0.626 0.4839 0.01051p.smaller[3] 0.4875 0.4998 0.01109p.smaller[4] 0.9275 0.2593 0.006629p.smaller[5] 0.449 0.4974 0.009629p.smaller[6] 0.459 0.4983 0.009853p.smaller[7] 0.5915 0.4916 0.01047p.smaller[8] 0.6325 0.4821 0.01033p.smaller[9] 0.9555 0.2062 0.004386p.smaller[10] 0.7575 0.4286 0.01117p.smaller[12] 0.431 0.4952 0.009716p.smaller[13] 0.631 0.4825 0.009968p.smaller[14] 0.633 0.482 0.0116p.smaller[15] 0.591 0.4916 0.009021p.smaller[16] 0.4285 0.4949 0.012p.smaller[17] 0.571 0.4949 0.01115p.smaller[18] 0.8505 0.3566 0.007033p.smaller[19] 0.712 0.4528 0.009266p.smaller[20] 0.052 0.222 0.004984p.smaller[21] 0.235 0.424 0.008387p.smaller[22] 0.175 0.38 0.008043p.smaller[23] 0.093 0.2904 0.006644p.smaller[24] 0.09 0.2862 0.007328p.smaller[25] 0.4685 0.499 0.01068




The CPO, checking function 5, can be explicitly calculated inWinBUGS using the relationship

1f (y rep

i |y(i))=

f (y(i))

f (y)

=

∫ f (y(i)|θ)f (θ)

f (y)dθ

=

∫1

f (yi |θ)

f (y|θ)f (θ)

f (y)dθ

=

∫1

f (yi |θ)f (θ|y)dθ

= Eθ|y

[1

f (yi |θ)

]




Thus, the i th CPO can be estimated from the inverse of the samplemean of the inverse likelihood of yi for each θ generated from the fullposterior distribution.I.e. a Monte Carlo estimate of CPOi is

ˆCPOi =

(1N

N∑n=1

1f (yi |θ(n))

)−1

which is the harmonic mean of the likelihood function. But note thatharmonic means are notoriously unstable so care is required regardingconvergence!In WinBUGS:like[i] <- sqrt(tau/(2*PI))*exp(-0.5*pow(sr[i],2))p.inv[i] <- 1/like[i]




node mean sd MC errorp.inv[1] 34.12 27.03 0.7646p.inv[2] 9.383 1.698 0.04268p.inv[3] 8.959 1.627 0.03359p.inv[4] 31.2 20.32 0.4766p.inv[5] 8.929 1.512 0.03761p.inv[6] 8.712 1.41 0.03228p.inv[7] 9.184 1.669 0.042p.inv[8] 9.37 1.669 0.0396p.inv[9] 6273.0 154700.0 3500.0p.inv[10] 13.03 6.565 0.1362p.inv[11] 11.38 2.956 0.0671p.inv[12] 9.211 1.792 0.04563p.inv[13] 9.213 1.586 0.03934p.inv[14] 9.338 1.699 0.0409p.inv[15] 8.846 1.423 0.03416p.inv[16] 9.538 2.268 0.0458p.inv[17] 8.844 1.473 0.03562p.inv[18] 16.44 6.838 0.1572p.inv[19] 10.51 2.532 0.06173p.inv[20] 53.19 49.06 1.06p.inv[21] 13.66 7.111 0.1599p.inv[22] 30.14 53.81 1.025p.inv[23] 24.4 9.003 0.237p.inv[24] 27.73 21.34 0.5858p.inv[25] 8.82 1.473 0.03519



Global Goodness-of-fit Checks

The idea of global goodness-of-fit checks goes back to Rubin (1984).One constructs test statistics or other “discrepancy measures” D(y)that attempt to measure departures of the observed data from theassumed model (likelihood and prior distribution).For example, suppose we have fit a normal distribution to a sample ofunivariate data, and wish to investigate the model’s fit in the lower tail.We might compare the observed value of the discrepancy measure

D(y) = ymin

with its posterior predictive distribution, p(D(yrep)|y), where yrep

denotes a hypothetical future value of y. If the observed value isextreme relative to this reference distribution, doubt is cast on someaspect of the model.



Posterior Predictive Model Checks

In order to be computable in the classical framework, test statisticsmust be functions of the observed data alone. But as pointed out byGelman et al. (1996), basing Bayesian model checking on theposterior predictive distribution allows generalized test statistics D(y, θ)that depend on the parameters as well as the data.For example, as an omnibus goodness-of-fit measure, Gelman et al.(1996) recommend

D(y, θ) =n∑

i=1

(yi − E [Yi |θ])2

Var(Yi |θ).

With θ varying according to its posterior distribution, we would nowcompare the distribution of D(y, θ) for the observed y with that ofD(y∗, θ) for a future observation y∗.



Posterior Predictive Model Checks

A convenient summary measure of the extremeness of the former withrespect to the latter is the tail area

pD = P [D(yrep, θ) > D(y, θ)|y]

=

∫P [D(yrep, θ) > D(y, θ)|θ] p(θ|y)dθ.

In the case where D(y∗, θ)’s distribution is free of θ, pD is exactly equalto the frequentist P-value, or the probability of seeing a test statistic asextreme as the one actually observed.

As such, pD is sometimes referred to as the Bayesian P-value.



Posterior Predictive Model Checks in WinBUGS

In Example 4.1, we consider 2 different statistics for D(y , θ) which maybe sensitive to outlying observations in a Normal model. These are

I coefficient of skewness: E

[(X − µσ

)3]

measure of asymmetry, skewness of Normal rv is zero

I coefficient of kurtosis: E

[(X − µσ

)4]

measure of peakedness, kurtosis of Normal rv is 3




for (i in 1:n){#residuals and moments for observed data

r[i]<- time[i]-mu[i]sr[i]<- (time[i]-mu[i])*sqrt(tau)m3[i] <- pow(sr[i],3)m4[i] <- pow(sr[i],4)

# residuals and moments of replicates for Bayesian p-valuestime.rep[i] ~ dnorm(mu[i], tau)resid.rep[i] <- time.rep[i]-mu[i]sresid.rep[i] <- resid.rep[i]*sqrt(tau)m3.rep[i] <- pow(sresid.rep[i],3)m4.rep[i] <- pow(sresid.rep[i],4) }




# Bayesian p-value:

skew.obs <- sum(m3[])/nskew.rep <- sum(m3.rep[])/np.skew <- step(skew.rep-skew.obs)

kurtosis.obs <- sum(m4[])/nkurtosis.rep <- sum(m4.rep[])/np.kurtosis <- step(kurtosis.rep-kurtosis.obs)



Bayesian P-values in WinBUGS

node mean sd MC errorskew.obs 0.09787 0.8858 0.0185skew.rep -0.02244 0.7959 0.01879p.skew 0.4685 0.499 0.01028kurtosis.obs 3.783 2.754 0.05979kurtosis.rep 3.045 2.023 0.04379p.kurtosis 0.417 0.4931 0.01081






4 WinBUGS Applications 4.4 Model Comparison via DIC

Model Comparison via DIC

In general, for model comparison we need:I measure of fitI measure of complexity

e.g.

AIC = −2 log p(y |θ) + 2pBIC = −2 log p(y |θ) + p log n



Problems with Classical Information Criteria

Problems:I χ2-approximation for small samplesI p = no. of parameters in hierarchical modelsI n = no. of observations in hierarchical models



Deviance

Suggestion by Dempster (1974):

Base model assessment on posterior distribution of the log-likelihoodof the data.

This is equivalent to posterior distribution of the deviance:

D(θ) = −2 log p(y |θ) + 2 log p(y |θsat )



Deviance Information Criterion

Suggestion by Spiegelhalter et al. (2002)

measure of fit: D = Eθ|y [D]posterior mean of deviance

measure of complexity: pD = D − D(θ)effective no. of parameters

DIC = D + pD = D(θ) + 2pD

The model with the smallest DIC value is preferred. DIC calculation isimplemented in WinBUGS.



DIC Example: Multiple Linear Regression

We will illustrate the use of DIC for comparing four different models forthe softdrink Example 4.1

1. Model 1: intercept only2. Model 2: cases3. Model 3: distance4. Model 4: cases and distance

We run each model in WinBUGS and set the DIC tool in the Inferencemenu.



DIC Output

Dbar = post.mean of -2logL;Dhat = -2LogL at post.mean of stochastic nodes

Model Dbar Dhat pD DICIntercept 209.092 207.061 2.031 211.123Cases 143.549 140.477 3.072 146.622Distance 170.575 167.503 3.072 173.647Cases + Distance 131.289 127.030 4.259 135.547






4 WinBUGS Applications 4.5 Analysis of Variance

ANOVA Models

NowI response variable Y : continuousI explanatory variable X : discrete

X is called a factor with levels i = 1, . . . , IANOVA Model

Yij ∼ N(µi , σ2) i = 1, . . . , I, j = 1, . . . ,ni

whereI Yij is j th observation of Y at level i of XI µi = β0 + βiβ0 overall common meanβi group-specific parameter



Parametrizations and Interpretations

Need a constraint to make I + 1 parameters β0, β1, . . . , βI identifiable.Either:

Corner Constraint:Effect of baseline level (or reference category) is set to 0: β1 = 0µ1 = β0

µi = β0 + βi , i = 2, . . . , I

or

Sum-to-zero Constraint:

I∑i=1

βi = 0 or β1 = −I∑

i=2

βi

β0 = 1I∑I

i=1 µi overall mean effectβi deviation of each level from this overall mean effect



ANOVA in WinBUGS

Assume data are given in pairs (xi , yi), i = 1, . . . ,n (n =∑

i ni )

#likelihoodfor (i in 1:n){

y[i] ~ dnorm(mu[i],tau)mu[i] ~ beta0 + beta[x[i]]}

#corner constraintbeta[1] <- 0.0#sum-to-zero constraint#beta[1] <- - sum( beta[2:I] )#priorbeta0 ~ dnorm(0.0,1.0E-4)for (i in 2:I){

beta[i] ~ dnorm(0.0,1.0E-4)}



ANOVA Example

Example 4.2McCarthy (2007) describes a dataset of weights of starlings at fourdifferent locations.

Location 1 Location 2 Location 3 Location 478 78 79 7788 78 73 6987 83 79 7588 81 75 7083 78 77 7482 81 78 8381 81 80 8080 82 78 7580 76 83 7689 76 84 75



Classical ANOVA

Frequentist analysis in R:

star.df <- read.table("starlingdata.txt", header=TRUE)attach(star.df)loc<- factor(location)star.aov<-aov(Y~loc)anova(star.aov)summary.lm(star.aov)$coef



R-Output

Analysis of Variance Table

Response: YDf Sum Sq Mean Sq F value Pr(>F)

loc 3 341.90 113.97 9.0053 0.0001390 ***Residuals 36 455.60 12.66---

> summary.lm(star.aov)$coefEstimate Std. Error t value Pr(>|t|)

(Intercept) 83.6 1.124969 74.313150 5.325939e-41loc2 -4.2 1.590947 -2.639938 1.218170e-02loc3 -5.0 1.590947 -3.142783 3.342926e-03loc4 -8.2 1.590947 -5.154164 9.372412e-06



WinBUGS Code

model{ for (i in 1:40) {

mu[i] <- beta0 + beta[location[i]]Y[i] ~ dnorm(mu[i], tau)}

#prior, corner constraintbeta[1] <- 0beta0 ~ dnorm(0.0,1.0E-4)

for (i in 2:4){beta[i] ~ dnorm(0.0, 1.0E-6)}

tau ~ dgamma(0.001, 0.001) # uninformative prio precision}#initslist(beta0=70, beta=c(NA, 70, 70, 70), tau=1)



WinBUGS Code

#datalocation[] Y[]1 78...1 892 78...2 763 79...3 844 77...4 75END



WinBUGS Results

node mean sd MC error 2.5% median 97.5% start sample

beta[2] -4.204 1.65 0.03838 -7.302 -4.162 -0.9981 1001 2000beta[3] -4.963 1.597 0.04041 -7.977 -4.964 -1.699 1001 2000beta[4] -8.143 1.61 0.03213 -11.26 -8.168 -5.014 1001 2000beta0 83.58 1.142 0.02757 81.31 83.59 85.7 1001 2000tau 0.07878 0.01887 4.333E-4 0.04582 0.07712 0.1183 1001 2000



WinBUGS Results

Using the comparison tool of the Inference menu and clicking on"boxplot" for beta:

[2]

[3]

[4]

box plot: beta

-15.0

-10.0

-5.0

0.0

Figure 12: Boxplot of location effects.



Model Comparison

Let us compare the fit of this one-way ANOVA model with a model thatassumes no differences in the expected weights at the differentlocations:

for (i in 1:40) {Y[i] ~ dnorm(beta0, tau)}

Model Dbar Dhat pD DICANOVA 216.156 211.053 5.103 221.259Same Mean 235.316 233.229 2.087 237.402




4 WinBUGS Applications 4.6 Generalized Liner Models

Generalized Linear Models

Generalized Linear Models (GLMs) are a generalization of the linearmodel for modelling of random variables from the exponential family,thus including the Normal, Binomial, Poisson, Exponential and Gammadistributions.GLMs are one of the most important components of modern statisticaltheory, unifying an approach to statistical modellingDetails on GLMs can be found in McCullagh and Nelder (1989),Fahrmeir and Tutz (2001), and Dey, Ghosh, Mallick (2000)



Generalized Linear Models

3 components of a LM:I stochastic component: Yi ∼ N(µi , σ

2), i.e. E [Yi ] = µi

I systematic component: µi = x′iβ (linear predictor)I link function: g(µi) = µi identity

3 components of a GLM:I stochastic component: Yi ∼ Exponential family with location

parameter θ, dispersion parameter φI stystematic component: ηi = x′iβI link function: g(θi) = ηi



Models for Binary Response

Example 4.3Fahrmeir and Tutz (1994) describe data provided by the KlinikumGrosshadern, Munich, on infection from births by Caesarean section.The response variable of interest is the occurrence or nonoccurrenceof infection, with three dichotomous covariates: whether theCaesarean section was planned or not, whether any risk factors suchas diabetes, being overweight etc were present or not and whetherantibiotics were given as a prophylaxis. The aim was to analyse theeffects of the covariates on the risk of infection, especially whetherantibiotics can decrease the risk of infection.




The binary data are summarized in the following table:

Caesarean planned Not plannedInfection Infection

yes no yes noAntibioticsRisk factors 1 17 11 87No risk factors 0 2 0 0

No antibioticsRisk factors 28 30 23 3No risk factors 8 32 0 9




Let Yi = 1 if infection occurs for i th patient, 0 otherwise and xi denotethe corresponding vector of covariate values.

I Yi |xi , θi ∼ Bernoulli(θi )I ηi = x′iβ = β0 + β1xi1 + β2xi2 + β3xi3

I link function η = g(θ) or θ = F (η) where F cdf

I logit model: g(θ) = log(

θ

1− θ

)θ =

eη

1 + eηlogistic cdf

I probit model: g(θ) = Φ−1(θ),θ = Φ(η) Normal cdf

I complimentary log-log model: g(θ) = log(− log(1− θ))

θ = 1− exp(−exp(η))extreme-minimal-value cdf



Interpretation of Logit Parameters

log(

θ

1− θ

)= β0 + β1x

θ

1− θ= exp(β0) exp(β1x)

Exponentials of covariate effects have a multiplicative effect on theodds/relative risk.

ORx ,x+1 =odds(x + 1)

odds(x)=

exp(β0) exp(β1(x + 1)

exp(β0) exp(β1x)= exp(β1)

If x increase by 1 unit, odds ratio increases by exp(β1) units.For other link functions

I Interpret covariate effects on linear predictor η = x′β.I Transform this linear effect on η into a nonlinear effect on θ (with

the aid of a graph of the response function θ = g−1(η).Prof. Dr. Renate Meyer Applied Bayesian Inference 254


Logit WinBUGS Code

model{for( i in 1 : N ) {

y[i] ~ dbern(p[i])logit(p[i]) <- beta0 + beta[1] *plan[i] +

beta[2]*factor[i] + beta[3]*antib[i]# centered covariates# logit(p[i])<-beta0 + beta[1] *(plan[i]-mean(plan[])) +# beta[2]*(factor[i]-mean(factor[])) +# beta[3]*(antib[i]-mean(antib[]))

}beta0 ~ dnorm(0.0,0.001)for (i in 1:3){

beta[i] ~ dnorm(0.0,0.001)or[i] <- exp(beta[i])}

}list(beta0=0,beta=c(0,0,0)) #initslist(N=251) #dataProf. Dr. Renate Meyer Applied Bayesian Inference 255


WinBUGS Output

beta[1]

iteration

1001 1500 2000 2500 3000

-3.0

-2.0

-1.0

0.0

1.0

beta[2]

iteration

1001 1500 2000 2500 3000

0.0

1.0

2.0

3.0

4.0

beta[3]

iteration

1001 1500 2000 2500 3000

-6.0 -5.0 -4.0 -3.0 -2.0 -1.0

beta0

iteration

1001 1500 2000 2500 3000

-3.0

-2.0

-1.0

0.0

1.0

Figure 13: Traceplots for uncentered covariates.



WinBUGS Output

beta[1]

iteration

1001 1500 2000 2500 3000

-4.0 -3.0 -2.0 -1.0 0.0 1.0

beta[2]

iteration

1001 1500 2000 2500 3000

0.0

1.0

2.0

3.0

4.0

beta[3]

iteration

1001 1500 2000 2500 3000

-6.0 -5.0 -4.0 -3.0 -2.0 -1.0

beta0

iteration

1001 1500 2000 2500 3000

-2.5

-2.0

-1.5

-1.0

-0.5

Figure 14: Traceplots for centered covariates.



WinBUGS Output

Centered Covariates

beta[1]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[2]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[3]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta0

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

Uncentered Covariates

beta[1]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[2]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[3]

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta0

lag

0 20 40

-1.0 -0.5 0.0 0.5 1.0

Figure 15: Autocorrelation plots.



WinBUGS Output

Summary Statistics for model with uncentered covariates:

node mean sd MC error 2.5% median 97.5%beta[1] -1.116 0.4392 0.02788 -1.993 -1.114 -0.2388beta[2] 2.069 0.4982 0.03463 1.157 2.055 3.057beta[3] -3.333 0.4921 0.02534 -4.346 -3.316 -2.393beta0 -0.8242 0.5331 0.04337 -1.961 -0.8118 0.1738or[1] 0.3604 0.1639 0.009911 0.1362 0.3282 0.7878or[2] 8.988 4.894 0.3246 3.181 7.804 21.26or[3] 0.04017 0.02009 0.001003 0.01295 0.03628 0.09139

None of the 95% credible intervals of covariate effects contains 0.

Antibiotics lower the odds of infection by a factor of 0.04. When theCaesarean is planned, the odds of infection decreases by a factor of0.36, and when risk factors are present, the odds of infection is 8.988higher.



Comparing Model Fits

Consider 3 different models with 3 different link functions. Comparethe fit with DIC.

Dbar Dhat pD DICLogit 230.621 226.588 4.033 234.654Probit 231.221 227.041 4.180 235.400Cloglog 228.101 224.152 3.949 232.050

The complementary log-log link seems to give a better fit but there areonly minor differences in the values of DIC.


4 WinBUGS Applications 4.7 Hierarchical Models

Hierarchical Models

In many statistical applications, model parameters are related by thestructure of the problem.For example, in a study of the effectiveness of cardiac treatments, it isassumed that patients in hospital j have survival probability θj .Estimating each of these θj separately, might result in large standarderrors for hospitals with few patients. It can also lead to overfitting andlead to models that cannot predict new data well.Assuming all survival probabilities are the same θ will ignore potentialtreatment differences between hospitals and will not fit the dataaccurately.It might be reasonable to expect that the θj ’s are related and should beestimated jointly. This is achieved in a natural way by assuming thatthe θj ’s come from a common population distribution. This populationdistribution can depend on a further parameter.



Hierarchical Models

Hierarchical model with hyperparameters:

Yij |θj ∼ f (yij |θj)

θj |φ ∼ f (θj |φ)

φ ∼ f (φ)



Hierarchical Models: Rat Tumor Example

Example 4.4This example in the context of drug evaluation for possible clinical trial

application is taken from Gelman et al. (2004). A control group of 14laboratory rats of type ’F344’ is given a zero dose of a certain drug.The aim is to estimate the probability θ of developing endometrialstromal polyps (a certain tumor). The outcome is that 4 out of 14 ratsdeveloped this tumor.

1. Approach: Bayesian model with fixed prior

Yi |θ ∼ Binomial(14, θ)

θ ∼ Beta(α, β)

Assume that we know from historical data the mean and sd of tumorprobabilities among female lab rats of type ’F344’. We find values of αand β of the beta distribution with this mean and sd. This yields aBeta(α + 4, β + 14) posterior distribution for θ.




Historical Data: 70 previous experiments on same type of rats

0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/190/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/191/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/202/20 1/10 5/49 2/19 5/46 3/37 2/17 7/49 7/47 3/203/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/204/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/206/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24

Observed sample mean and sd of yj/nj is 0.136 and 0.103,respectively. Setting

0.136 =α

α + β

0.1032 =αβ

(α + β)2(α + β + 1)

yields α = 1.4 and β = 8.6Prof. Dr. Renate Meyer Applied Bayesian Inference 264



Using a Beta(1.4,8.6) prior for θ yields a Beta(5.4,18.6) posteriordistribution with posterior mean= 0.223 and posterior sd= 0.083,whereas 4/14 = 0.286.

Assumptions:I θ1, . . . , θ70, θ71 can be considered a random sample from a

common distributionI no time trend

Questions:I Can we use the same prior to make inference about the tumor

probabilities in the first 70 groups?I Is the point estimate used to derive α and β representative?I Does it make sense to estimate α and β?




2. Approach: Hierarchical Bayesian modelIn absence of any information about the θjs (other than the data) andno ordering or grouping of the parameters can be made, we mustassume symmetry in the prior distribution of the parameters.

This means, that the parameters (θ1, . . . , θJ) are modelled asexchangeable in their joint prior distribution. I.e.

f (θ1, . . . , θJ) is invariant to permutations of the indices (1, . . . , J).




Assume simplest form of exchangeability:θj are iid given some unknown parameter φ:

f (θ1, . . . , θJ |φ) =J∏

i=1

f (θj |φ)

By integration, the joint (unconditional or marginal) distribution is

f (θ1, . . . , θJ) =

∫ [ J∏i=1

f (θj |φ)

]f (φ)dφ

De Finetti’s theorem states that as J →∞, any exchangeabledistribution (under certain regularity conditions) can be written in the iidmixture form above.




Key part of hierarchical models:φ is unknown, has a prior distribution, f (φ), and we estimate itsposterior distribution after observing the data. We have a parametervector (θ, φ) with joint prior distribution:

f (θ, φ) = f (φ)f (θ|φ)

The joint posterior distribution is:

f (θ, φ|y) = f (y |θ, φ)f (φ, θ)

= f (y |θ)f (θ|φ)f (φ)



Hyperprior Distribution:

If little is known about the hyperparameter φ, we can assign a diffuseprior distribution. But we always need to check whether the resultingposterior distribution is proper. In most real problems, there is sufficientsubstantial knowledge about φ to constrain φ to some finite region.

In the rat tumor Example 4.4, we reparametrize to µi =logit(θi), i.e.

θi =exp(µi)

1 + exp(µi)

µi ∼ N(ν, τ)

and specify the following diffuse hyperprior distrution for mean ν andprecision τ :

ν ∼ N(0,0.001)

τ ∼ Gamma(0.001,0.001)



WinBUGS Code: Rat Tumor Example

# rat examplemodel{ for (i in 1:71){

y[i] ~ dbin(theta[i],n[i])theta[i] <- exp(mu[i])/(1+exp(mu[i]))mu[i] ~ dnorm(nu,tau)r[i]<-y[i]/n[i]}

nu ~ dnorm(0.0,0.001)tau ~ dgamma(0.001,0.001)mtheta<-exp(nu)/(1+exp(nu))}#initslist(nu=0,tau=1)



WinBUGS Output: Rat Tumor Example

Based on 10,000 iterations and burn-in of 10,000:

node mean sd MC error 2.5% median 97.5%mtheta 0.1261 0.01336 3.035E-4 0.1002 0.126 0.1526nu -1.941 0.1224 0.002774 -2.195 -1.937 -1.715tau 2.399 1.134 0.03409 1.052 2.184 4.891theta[71] 0.2059 0.077 7.983E-4 0.0827 0.1965 0.3825

From the boxplot and the "model fit" plot of θj estimates against sampleproportions rj , we see that rates θj are shrunk from their sample pointestimate rj = yj/nj , towards the population distribution with mean0.126. Experiments with fewer observations are shrunk more and havehigher posterior variances. In contrast to the model with fixed priorparameters, this full Bayesian hierarchical analysis has taken theuncertainty in the hyperparameters into account.




[1][2]

[3]

[4]

[5]

[6]

[7]

[8][9]

[10]

[11]

[12][13]

[14]

[15][16]

[17][18]

[19]

[20]

[21]

[22][23]

[24]

[25]

[26]

[27]

[28]

[29]

[30][31]

[32]

[33][34]

[35]

[36]

[37]

[38][39]

[40][41]

[42]

[43]

[44]

[45]

[46]

[47][48]

[49]

[50]

[51]

[52][53]

[54][55]

[56]

[57]

[58]

[59][60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

box plot: theta

0.0

0.2

0.4

0.6

Figure 16: Boxplots for rat tumor rates.




model fit: theta

0.0 0.1 0.2 0.3 0.4

0.0

0.2

0.4

0.6

Figure 17: Model fit for rat tumor rates.



Hierarchical Models: Pump Failure Example

Example 4.5George et al (1993) discuss Bayesian analysis of hierarchical models.The example they consider relates to 10 power plant pumps. The datais given in the following table and gives the number of failures xi andthe length of operation time ti (in thousands of hours) for each pump.

Pump ti xi

1 94.50 52 15.70 13 62.90 54 126.00 145 5.24 36 31.40 197 1.05 18 1.05 19 2.10 410 10.50 22



Hierarchical Models: Pump Failure Example

The number of failures Xi is assumed to follow a Poisson distribution:

Xi |θi ∼ Poisson(θi ti) i = 1, . . . ,10

where θi denotes the failure rate for pump i .Assuming that the failure rates of the pumps are related, we specify ahierarchical Bayesian model and a conjugate prior distribution for θi :

θi ∼ Gamma(α, β), i = 1, . . . ,10.

We have insufficient information about the pump failure rates to specifyvalues for α and β but want the data to inform us about these. Wespecify a hyperprior distribution using substantive knowledge:

α ∼ Exponential(1.0)

β ∼ Gamma(0.1,1.0)



WinBUGS Code: Pump Failure Example

model{for (i in 1 : N) {

theta[i] ~ dgamma(alpha, beta)lambda[i] <- theta[i] * t[i]x[i] ~ dpois(lambda[i])}

alpha ~ dexp(1)beta ~ dgamma(0.1, 1.0)}list(t=c(94.3,15.7,62.9,126,5.24,31.4,1.05,1.05,2.1,10.5),

x=c(5,1,5,14,3,19,1,1,4,22), N=10) #datalist(alpha = 1, beta = 1) #inits



WinBUGS Output: Pump Failure Example


node mean sd MC error 2.5% median 97.5%alpha 0.6874 0.2723 0.007535 0.2806 0.6456 1.338beta 0.9126 0.5411 0.01506 0.1771 0.8161 2.222theta[1] 0.0599 0.02496 3.49E-4 0.02099 0.05683 0.1184theta[2] 0.1012 0.07978 0.001012 0.00801 0.08247 0.3089theta[3] 0.08922 0.03818 5.284E-4 0.03137 0.08349 0.1786theta[4] 0.1148 0.03023 3.901E-4 0.06324 0.1121 0.1829theta[5] 0.5964 0.3127 0.004145 0.1508 0.5445 1.338theta[6] 0.6067 0.137 0.001753 0.3761 0.595 0.9082theta[7] 0.9106 0.7541 0.01089 0.07487 0.7165 2.845theta[8] 0.8997 0.7396 0.01236 0.07952 0.7016 2.732theta[9] 1.599 0.7679 0.01115 0.4925 1.467 3.444theta[10] 1.995 0.4327 0.00605 1.254 1.966 2.917



MLE: Pump Failure Example

To compare the results to maximum likelihood estimates (MLE) forindividual pump failures, we calculate the (log)likelihood:

f (xi |θi) =(θi ti)xi

xi !exp(θi ti)

log f (xi |θi) = xi log(θi ti)− θi ti

Setting the first derivative to 0 and solving w.r.t θi gives

θi =xi

ti



MLE Comparison: Pump Failure Example

The following table compares MLE’s and Bayesian estimates:

hours failures MLE Bayesian94.50 5 0.0530 0.059915.70 1 0.0637 0.101262.90 5 0.0795 0.08922126.00 14 0.1111 0.11485.24 3 0.5725 0.596431.40 19 0.6051 0.60671.05 1 0.9524 0.91061.05 1 0.9524 0.89972.10 4 1.9048 1.59910.50 22 2.0952 1.995



Remarks: Pump Failure Example

I Individual estimates are "shrunk" away from MLE toward acommon mean.

I Individual estimates "borrow strength" from the rest of the data.I θis for observations with large "sample size" (operation time) are

shrunk less than θis for other observations.I θis far from the common mean (0.7389) are shrunk more than

those near it.



Boxplot: Pump Failure Example

[1]

[2][3] [4]

[5]

[6]

[7][8]

[9]

[10]

box plot: theta

0.0

1.0

2.0

3.0

4.0

Figure 18: Boxplots for pump failure rates.



Model Fit Plot: Pump failure Example

model fit: theta

0.0 1.0 2.0 3.0

0.0

1.0

2.0

3.0

4.0

Figure 19: Model fit for pump failure rates.






4 WinBUGS Applications 4.8 Survival Analysis

Survival Analysis

Survival Analysis refers to a class of statistical models used to analysethe duration of time until an event of interest (such as death, tumoroccurrence, component failure) occurs. Time-to-event data arise inmany disciplines, including medicine, biology, engineering,epidemiology and economics. Frequentist textbooks include Cox andOakes (1984) and Klein and Moeschberger (1997); a comprehensiveBayesian perspective is given in Ibrahim, Chen and Sinha (2001).

As duration times are non-negative, only non-negative randomvariables can be used to model survival times.

Failure time data are often censored, i.e. incomplete in that one knowsthat a patient survived the study end point, but one does not know theexact time of death.

In survival analysis, we are less interested in the mean of thedistribution but we are interested in the hazard function.



Hazard Function

Let T be a continuous nonnegative random variable, representing theduration time until a certain event occurs. Let f (t) denote the pdf andF (t) the cdf of T . Let S(t) = 1− F (t) = P(T ≥ t) be the survivalfunction, which provides the probability of surviving until timepoint t .

Definition 4.6The hazard function is defined as

h(t) = limδt→0

P(t < T ≤ t + δt |T > t)δt

=f (t)S(t)

= −S′(t)S(t)

and can be interpreted as the instantaneous death (or event) rate of anindividual, provided that this person survived until time t . In particular,

h(t)δt is the approximate probability of failure in [t , t + δt), givensurvival up to time t .



Hazard Function

Since f (t) = − ddt S(t), Definition 4.6 implies that

h(t) = − ddt

log S(t) (4.1)

Integrating both sides of (4.1), and then exponentiating, yields

S(t) = exp(−∫ t

0h(u)du

). (4.2)

The cumulative hazard, Ht(t) is defined as

H(t) =

∫ t

0h(u)du

so S(t) = exp(−H(t)). Since S(∞) = 0, H(∞) =∞.Prof. Dr. Renate Meyer Applied Bayesian Inference 287


Hazard Function

Thus, the hazard function has the properties

h(t) ≥ 0 and∫ ∞

0h(t)dt =∞.

Finally, it follows from Definition 4.6 and (4.1) that

f (t) = h(t) exp(−∫ t

0h(u)du

). (4.3)



Example: Weibull Distribution

Suppose T has pdf

f (t) =

{ργtρ−1 exp(−ρtα), for t > 0, ρ > 0, γ > 0,0, otherwise.

This is a Weibull distribution with parameters (ρ, γ). It follows easilyfrom the equations above, that

I S(t) = exp(−γtρ),I h(t) = γρtρ−1,I H(t) = γtρ.



Proportional Hazards Models

The hazard function depends in general on both time and a set ofcovariates. The proportional hazards model (Cox, 1972) separatesthese components by specifying that the hazard at time t for anindividual whose covariate vector is x is given by

h(t ,x) = h0(t) exp{G(x,β)}where h0(t) is called the baseline hazard function and β is a vector ofregression coefficients. The second term is written in exponential formbecause it must be positive.

The ratio of hazards for two individuals is constant over time. Often,the effect on the covariates is assumed to be multiplicative, leading tothe hazard function

h(t ,x) = h0(t) exp(x′β)

where η = x′β is called the linear predictor. Thus the ratio of hazardsfor two individuals depends on the difference between their linearpredictors at any time.



Partial Likelihood

Cox’s version (Cox, 1975) of the proportional hazards model issemiparametric as the baseline hazard function h0(t) is not modeledas a parametric function of t .

Assumptions:I n individuals, d have distinct event times, n − d have right

censored survival timesI no ties, ordered survival times: y(1), . . . , y(d)

I Rj = set of individuals who are at risk at time y(j), j th risk set

Then the partial likelihood is:

PL(β) =n∏

i=1

exp(x′(j)β)∑l∈Rj

exp(x′lβ)(4.4)

The partial MLE of β can be obtained by maximizing (4.4) w.r.t. β.Prof. Dr. Renate Meyer Applied Bayesian Inference 291


Likelihood under Censoring

Survival data are often right-censored. An observation is said to beright-censored at c if the exact value of the observation is not knownbut only that it is greater than c.

Let n be number of subjects where individual i has survival time ti andfixed censoring time ci . ti are iid with pdf f (t).The exact survival time ti of an individual will be observed only if ti ≤ ci .Data can be represented by n pairs of random variables (yi , νi) where

yi = min(ti , ci)

and

νi =

{0 if ti ≤ ci ,1 if ti > ci .




The likelihood function for (β,h0(t)) for right censored data:

L(β,h0(t)|D) ∝n∏

i=1

f (yi)1−νi S(yi)

νi

∝n∏

i=1

h(yi)1−νi S(yi)

1−νi S(yi)νi

∝n∏

i=1

h(yi)1−νi S(yi)

∝n∏

i=1

h(yi)1−νi exp{−H(yi)}

∝n∏

i=1

h0(yi) exp(ηi)]1−νi exp{−exp(ηi)H0(yi)}

where the data D = (n,y,X , ν).Prof. Dr. Renate Meyer Applied Bayesian Inference 293



If we assume a parametric model for the baseline hazard, e.g.Weibull(α,1), and define γi = exp(ηi), then the likelihood above is thatof independent censored Weibull(α, γi) distributions.



Censoring in WinBUGS

In WinBUGS, right censoring can be implemented using the commandI(a,) (and I(,b) and I(a,b) for left and interval censoring,respectively).

Two variables are required to define the survival times:I the actual survival times t[i] taking NA values for censored

observations andI the censoring times t.cen[i], which take the value 0 when

actual survival times (deaths) are observed.

For example, the likelihood of a Weibull(ρ, γ) distribution with rightcensored data can be expressed as

t[i] dweib(rho,gamma)I(t.cen[i],)



Mice Example in WinBUGS

We will now look at the mice example in WinBUGS ExampleVolume 1.



MAC AIDS Trial

Here we come back to the analysis of controlled clinical AIDS trialdiscussed in the introduction. Our data arises from a clinical trialcomparing two treatments for Mycobacterium avium complex (MAC), adisease common in late state HIV-infected persons.

11 clinical centers (units) have enrolled a total of 69 patients in thetrial, of which 18 have died. The data have been analysed in Carlinand Hodges (1999) and Cai and Meyer (2011)

I For j = 1, . . . ,ni and i = 1, . . . , k lettij = time to death or censoringxij = treatment indicator for subject j in stratum i

I The next page gives survival times (in half-days) from the MACtreatment trial, where "+" indicates a censored observation



Primary Endpoint Data

Unit Treatm. Time Unit Treatm. Time Unit Treatm. TimeA 1 74+ B 2 4+ F 1 6A 2 248 B 1 156+ F 2 16+A 1 272+ F 1 76A 2 244 C 2 20+ F 2 80D 2 20+ E 1 50+ F 2 202D 2 64 E 2 64+ F 1 258+D 2 88 E 2 82 F 1 268+D 2 148+ E 1 186+ F 2 368+D 1 162+ E 1 214+ F 1 380+D 1 184+ E 1 214 F 1 424+D 1 188+ E 2 228+ F 2 428+D 1 198+ E 2 262 F 2 436+D 1 382+D 1 436+G 2 32+ H 2 22+ I 2 8G 1 64+ H 1 22+ I 2 16+G 2 102 H 1 74+ I 2 40G 2 162+ H 1 88+ I 1 120+G 2 182+ H 1 148+ I 1 168+G 1 364+ H 2 162 I 2 174+J 1 18+ K 1 28+ I 1 268+J 1 36+ K 1 70+ I 2 276J 2 160+ K 2 106+ I 1 286+J 2 254 I 1 366

I 2 396+I 2 466+I 1 468+



Proportional Hazards Model

With proportional hazards and a Weibull baseline hazard, stratum i ’shazard is

h(tij) = h0(tij)i exp(β0 + β1xij)

= ρi trhoi−1ij exp(β0 + β1xij)

where ρi > 0 and β = (β0, β1).

The ρi allow differing baseline hazards which are increasing if ρi > 1and decreasing if ρi < 1. As the strata may be similar, we model theshape parameters as exchangeable, i.e.

ρiiid∼ Gamma(α, α).

Thus, the mean of the ρi is one, corresponding to a constant baselinehazard and variance 1

α . We put a proper but low informationGamma(3.0, 0.1) prior on α, reflecting a prior guess for the standarddeviation of ρi of 30−1/2 ≈ 0.18 allowing a fairly broad region of valuescentered around one.



Proportional Hazards Model

As in the mice example,

µij = exp(β0 + β1xij)

so thatTij ∼Weibull(ρi , µij).



Weibull Prop. Hazards: WinBUGS Code

model{for (i in 1 : 69) {

t[i] ~ dweib(rho[unit[i]], mu[i]) I(t.cen[i], )mu[i] <-exp(beta0+beta1*x[i])}

for (k in 1:11){rho[k] ~ dgamma(alpha,alpha)}

alpha ~ dgamma(3.0,0.1)beta0 ~ dnorm(0.0,0.001)beta1 ~ dnorm(0.0,0.001)r <- exp(2.0*beta1)}



WinBUGS Output


node mean sd MC error 2.5% median 97.5%alpha 48.45 20.12 0.3892 18.47 45.61 95.32beta0 -6.788 0.4114 0.01758 -7.626 -6.78 -6.006beta1 0.5973 0.2805 0.009956 0.06683 0.5894 1.189r 3.887 2.515 0.08594 1.143 3.251 10.78rho[1] 1.028 0.1078 0.002538 0.8111 1.029 1.237rho[2] 0.9848 0.1456 0.003415 0.704 0.9794 1.289rho[3] 0.972 0.1414 0.002471 0.7016 0.9696 1.255rho[4] 0.999 0.1108 0.004363 0.7739 1.0 1.214rho[5] 1.066 0.1024 0.002894 0.8667 1.064 1.273rho[6] 0.9642 0.08855 0.002924 0.7894 0.9654 1.133rho[7] 0.9724 0.1169 0.00354 0.748 0.9709 1.204rho[8] 1.038 0.1273 0.003974 0.7931 1.038 1.296rho[9] 0.9756 0.09325 0.003106 0.7885 0.9763 1.158rho[10] 1.008 0.12 0.002795 0.7667 1.006 1.248rho[11] 0.9616 0.1386 0.003722 0.6873 0.96 1.242



WinBUGS Output

I Units A, E, and H have increasing baseline hazard functions(ρi > 0).

I All other units have constant or decreasing baseline hazardfunctions (ρi ≤ 0).

I There is a significant treatment effect:95% CI for β1 does not include 095% CI for r does not include 1

I Posterior mean of the relative risk is closer to frequentist estimater = 3.1 for the unstratified Cox proportional hazards model (cf.Introduction).




4 WinBUGS Applications 4.9 State-Space Modelling of Time Series

State-Space Modelling of Time Series

State-space models are among the most powerful tools for dynamicmodeling and forecasting of time series and longitudinal data.Overviews can be found in Fahrmeir and Tutz (1994) and Kuensch(2001).Observation equation:

yt = ht (θt ) + vt

gives the conditional distribution of observations yt at time t givenlatent states θt . vt is an error distribution, e.g. N(0, σ2).State equation:

θt = gt (θt−1) + ut

gives the Markovian transition of state θt−1 to θt where ut denotes anerror distribution. The ability to include knowledge of the systembehaviour in the statistical model is largely what makes state-spacemodeling so attractive for biologists, economists, engineers andphysicists.



State-Space Modelling of Time Series

ML estimation of unknown parameters and latent states is difficult.

Kalman filter is applicable only for linear Gaussian state-space models.

For nonlinear non-normal state-space models, the likelihood function isintractable.

For nonlinear non-normal state-space models, Carlin et al. (1992)suggested the Gibbs sampler for posterior computation.

In the sequel, we will look at examples of state-space modelsimplemented in WinBUGS.



Fisheries Stock Assessment: Data

The data available for stock assessment purposes quite often consistof a time series of annual catches Ct , t = 1, . . . ,N, and relativeabundance indices It , t = 1, . . . ,N, such as research survey catchrates or catch-per-unit-effort (CPUE) indices from commercialfisheries.

For example, the next table gives an historical dataset of catch-effortdata of South Atlantic albacore tuna (Thunnus alalunga) from 1967 to1989. Catch is in thousands of tons and CPUE in (kg/100 hooks).



Fisheries Stock Assessment: Data

Yellowfin tuna data from Pella and Tomlinson (1969)

Year (t) Catch (Ct ) CPUE (It )1967 15.9 61.891968 25.7 78.981969 28.5 55.591970 23.7 44.611971 25.0 56.89

......

...1987 37.5 23.361988 25.9 22.361989 25.3 21.91



Fisheries Stock Assessment: Objectives

Age-composition data are not available for this stock. This dataset haspreviously been analysed by Polacheck et al. (1993).

Objectives: estimation ofI the size of the stock at the end of 1989,I the maximum surplus production (MSP),I the biomass at which MSP occurs (BMSP),I the optimal effort (EMSP), the level of commercial fishing effort

required to harvest MSP when the stock is at BMSP .When only catch-effort data are available, biomass dynamics modelsare the primary assessment tools for many fisheries (Hilborn andWalters 1992).



Fisheries Stock Assessment: Biomass Dynamics

Biomass Dynamics Model

new biomass = old biomass+ growth+ recruitment− natural mortality− catch

The biomass dynamics equations can be written in the form:

Bt = Bt−1 + g(Bt−1)− Ct−1

where Bt , Ct , and g(Bt ) denote biomass at the start of year t , catchduring year t , and the surplus production function, respectively.g(0) = g(K ) = 0, where K is the carrying capacity ( the level of thestock biomass at equilibrium prior to commencement of the fishery).



Fisheries Stock Assessment: Surplus Production Model

The Schaefer (1954) form of the surplus production function is

g(Bt−1) = rBt−1

(1− Bt−1

K

).

Substituting this in the biomass dynamics equation gives aparsimonious model describing the annual biomass dynamicstransitions with just the two parameters r , the intrinsic growth rate, andK :

Bt = Bt−1 + rBt−1

(1− Bt−1

K

)− Ct−1. (4.5)

Note that the annual catch is treated as a fixed constant.



Fisheries Stock Assessment: Relative Abundance Index

A common, though simplifying assumption is that the relativeabundance index is directly proportional to the biomass, i.e.

It = qBt (4.6)

with catchability parameter q.

For the Schaefer surplus production model, the maximum surplusproduction MSP = rK/4 occurs at BMSP = K/2.When the biomass indices are CPUE’s from commercial fishing, thenthe equation above gives MSP/EMSP = qK/2 and thereby the optimaleffort is EMSP = r/2q.



Fisheries Stock Assessment: Process and Observation Error

Polacheck et al. (1993) compare three commonly used statisticaltechniques for fitting the model defined by equations (4.5) and (4.6),process error models, observation error models, and equilibriummodels.None of these is capable of incorporating uncertainty present in bothequations:

I natural variability underlying the annual biomass dynamicstransitions (process error) and

I uncertainty in the observed abundance indices due tomeasurement and sampling error (observation error).



Fisheries Stock Assessment: State-Space Model

This is possible, however, using a state-space model.

Equations (4.5) and (4.6) are the deterministic versions of thestochastic state and observation equations.

We assumed log-normal error structures.We used a reparametrization (Pt = Bt/K ) by expressing the annualbiomass as a proportion of carrying capacity as in Millar and Meyer(2000) to speed mixing (i.e. sampling over the support of the posteriordistribution) of the Gibbs sampler.



Fisheries Stock Assessment: State-Space Model

State equations:

P1|σ2 = eu1 ,

Pt |Pt−1,K , r , σ2 = (Pt−1 + rPt−1(1− Pt−1)− Ct−1/K ) · eut , t = 2, . . . ,N(4.7)

Observation equations:

It |Pt ,q, τ2 = qKPt · evt , t = 1, . . . ,N, (4.8)

where ut are iid normal with mean 0 and variance σ2, and vt are iidnormal with mean 0 and variance τ2.



Fisheries Stock Assessment: Posterior Distribution

A fully Bayesian model consists of the joint prior distribution of allunobservables, here the five parameters, K , r ,q, σ2, τ2, and theunknown states, P1, . . . ,PN ,

and the joint distribution of the observables, here the relativeabundance indices I1, . . . , IN .

We assume that the parameters K , r ,q, σ2, τ2 are independent a priori.By a successive application of Bayes theorem and conditionalindependence of subsequent states, the joint prior density is given by

p(K , r ,q, σ2, τ2,P1, . . . ,PN) = p(K )p(r)p(q)p(σ2)p(τ2)p(P1|σ2)N∏

i=2

p(Pt |Pt−1,K , r , σ2).



Fisheries Stock Assessment: Prior Specification

A noninformative prior is chosen for q.Prior distributions for K , r , σ2, τ2 are specified using biologicalknowledge and inferences from related species and stocks asdiscussed in Millar and Meyer (2000):

K ∼ lognormal(µK = 5.04, σK = 0.5162),

r ∼ lognormal(µr = −1.38, σr = 0.51),

p(q) ∝ 1/q,σ2 ∼ inverse-gamma(3.79,0.0102),

τ2 ∼ inverse-gamma(1.71,0.0086).



Fisheries Stock Assessment: Likelihood

Because of the conditional independence assumption of the relativeabundance indices given the unobserved states, the samplingdistribution is

p(I1, . . . , IN |K , r ,q, σ2, τ2,P1, . . . ,PN) =N∏

t=1

p(It |Pt ,q, τ2). (4.10)

Then, by Bayes theorem, the joint posterior distribution of theunobservables given the data is

p(K , r ,q, σ2, τ2,P1, . . . ,PN |I1, . . . , IN) ∝ p(K )p(r)p(q)p(σ2)p(τ2)p(P1|σ2)N∏

i=2

p(Pt |Pt−1,K , r , σ2)

N∏t=1

p(It |Pt ,q, τ2)

(4.11)Prof. Dr. Renate Meyer Applied Bayesian Inference 318


Fisheries Stock Assessment: WinBUGS Code

model {# lognormal prior on KK ~ dlnorm(5.042905,3.7603664)I(10,1000)# lognormal prior on rr ~ dlnorm(-1.151293,1.239084233)I(0.005,1.0)# instead of improper (prop. to 1/q) use just proper IGiq ~ dgamma(0.001,0.001)I(0.5,200)q <- 1/iq# inverse gamma on isigma2isigma2 ~ dgamma(a0,b0)sigma2 <- 1/isigma2# inverse gamma on itau2itau2 ~ dgamma(c0,d0)tau2 <- 1/itau2



Fisheries Stock Assessment: WinBUGS Code

Pmean[1] <- 0P[1] ~ dlnorm(Pmean[1],isigma2) I(0.05,1.6)for (i in 2:N) {

Pmean[i]<-log(max(P[i-1] + r*P[i-1]*(1-P[i-1]) - C[i-1]/K,0.01))P[i] ~ dlnorm(Pmean[i],isigma2)I(0.05,1.5)}

for (i in 1:N) {Imean[i] <- log(q*K*P[i])I[i] ~ dlnorm(Imean[i],itau2)}

P24 ~ dlnorm(Pmean24, isigma2)I(0.05,1.5)Pmean24<- log(max(P[23] + r*P[23]*(1-P[23]) - C[23]/K,0.01))MSP<- r*K/4B_MSP<- K/2E_MSP<- r/(2*q)}



Fisheries Stock Assessment: DAG

for(t IN 2 : N)

C[t-1]

itau2iqK

I[t]Imed[t]

isigma2r

Pmed[t+1]P[t]Pmed[t]P[t-1]

Figure 20: Representation of surplus production model as DAG.



Fisheries Stock Assessment: WinBUGS Output


node mean sd MC error 2.5% median 97.5%BMSP 135.5 32.44 1.272 87.2 130.2 2121EMSP 0.6154 0.09112 0.001935 0.4346 0.6148 0.8002K 271.0 64.88 2.544 174.4 260.4 424.2MSP 19.52 2.537 0.05968 13.9 19.76 23.94P[1] 1.018 0.05427 8.062E-4 0.919 1.016 1.133P[2] 0.9944 0.07386 0.001368 0.8737 0.986 1.164P[3] 0.8772 0.06548 0.001485 0.7616 0.8726 1.019P[4] 0.7825 0.06205 0.001524 0.6711 0.779 0.9144P[21] 0.4175 0.03452 8.162E-4 0.3545 0.4156 0.491P[22] 0.353 0.03519 9.208E-4 0.292 0.35 0.4296P[23] 0.3271 0.03964 0.00103 0.2573 0.3241 0.4123P24 0.2964 0.04939 0.001221 0.2093 0.2926 0.4028q 0.2486 0.06136 0.002411 0.1449 0.244 0.3777r 0.3088 0.09576 0.003559 0.1416 0.3031 0.5104sigma2 0.003105 0.001912 2.22E-5 0.001132 0.00261 0.008057tau2 0.01225 0.004516 2.778E-5 0.005832 0.01145 0.02327







Example: Stochastic Volatility in Financial Time Series

The stochastic volatility (SV) model introduced by Tauchen and Pitts(1983) is used to describe financial time series. It offers an alternativeto the ARCH-type models of Engle (1982) for the well documentedtime varying volatility exhibited in many financial time series.

The SV model provides a more realistic and flexible modeling offinancial time series than the ARCH-type models, since it essentiallyinvolves two noise processes, one for the observations, and one for thelatent volatilities.

The so called observation errors account for the variability due tomeasurement and sampling errors whereas the process errors assessvariation in the underlying volatility dynamics.



Example: Stochastic Volatility in Financial Time Series

Classical parameter estimation for SV models is difficult due to theintractable form of the likelihood function. Recently, a variety offrequentist estimation methods have been proposed for the SV model,including Generalized Method of Moments (Melino and Turnbull(1990), Sorenson (2000)), Quasi-Maximum Likelihood (Harvey et al.,1994), Efficient Method of Moments (Gallant et al., 1997), SimulatedMaximum Likelihood (Danielsson, 1994, and Sandmann andKoopman, 1998), and approximate Maximum Likelihood (Fridman andHarris, 1998).

Bayesian MCMC procedures for the SV model have been suggestedby Jacquier et al. (1994), Shephard and Pitt (1997), Kim et al. (1998)and Meyer and Yu (2000). Here we demonstrate the implementation ofthe Gibbs sampler in WinBUGS.



Stochastic Volatility: Data

The data consist of a time series of daily Pound/Dollar exchange rates{xt} from 01/10/81 to 28/6/85. The series of interest are the dailymean-corrected returns, {yt}, given by the transformationyt = log xt − log xt−1 − 1

n∑n

i=1(log xi − log xi−1), t = 1, . . . ,n.

returns.dat

-0.3202213630797821.46071929942995-0.4086296198109471.060960273866851.712889207631630.404314365893326-0.905699012715806...2.22371628398118



Stochastic Volatility: State-Space Model

The SV model used for analyzing these data can be written in the formof a nonlinear state-space model:

Observation equations:

yt |θt = exp(

12θt

)ut , ut

iid∼ N(0,1), t = 1, . . . ,n. (4.12)

State equations:

θt |θt−1, µ, φ, τ2 = µ+ φ(θt−1 − µ) + vt , vt

iid∼ N(0, τ2), t = 1, . . . ,n,(4.13)

with θ0 ∼ N(µ, τ2).



Stochastic Volatility: Parameters

I θt determines the amount of volatility on day t ,I the value of φ, −1 < φ < 1, measures the autocorrelation present

in the logged squared data; thus φ can be interpreted as thepersistence in the volatility, and

I the constant scaling factor β = exp(µ/2) as the modal volatility,and

I τ as the volatility of log-volatilities.



Stochastic Volatility: Prior Specification

By successive conditioning, the joint prior density is

p(µ, φ, τ2, θ0, θ1, . . . , θn) = p(µ, φ, τ2)p(θ0|µ, τ2)n∏

t=1

p(θt |θt−1, µ, φ, τ2).

(4.14)

I We employ a slightly informative prior for µ, µ ∼ N(0,10).I We set φ = 2φ∗ − 1 and specify a Beta(α, β) prior for φ∗ withα = 20 and β = 1.5 which gives a prior mean for φ of 0.86.

I A conjugate inverse-gamma prior is chosen for τ2, i.e.τ2 ∼ IG(2.5,0.025) which gives a prior mean of 0.0167 and priorstandard deviation of 0.0236.



Stochastic Volatility: Likelihood

The likelihood p(y1, . . . , yn|µ, φ, τ2, θ0, . . . , θn) is specified by theobservation equations (4.12) and the conditional independenceassumption:

p(y1, . . . , yn|µ, φ, τ2, θ0, . . . , θn) =n∏

t=1

p(yt |θt ). (4.15)



Stochastic Volatility: Posterior Distribution

Then, by Bayes’ theorem, the joint posterior distribution of theunobservables given the data is proportional to the prior timeslikelihood, i.e.

p(µ, φ, τ2, θ0, . . . , θn|y1, . . . , yn) ∝ p(µ)p(φ)p(τ2)p(θ0|µ, τ2)∏nt=1 p(θt |θt−1, µ, φ, τ

2)×∏nt=1 p(yt |θt ).

(4.16)



Stochastic Volatility: DAG

for(t IN 1 : n)

mu

y[t]yisigma2[t]

itau2phi

thmean[t+1]theta[t]thmean[t]theta[t-1]

Figure 21: Representation of the stochastic volatility model as a DAG.



Stochastic Volatility: DAG

The solid arrows indicate that given its parent nodes, each node v isindependent of all other nodes except descendants of v .

For instance, if on day t we know the volatility on day t − 1 and thevalues of the parameters µ, φ, and τ2, then our belief about thevolatility, θt , on day t is independent of the volatilities on previous days1 to t − 2 and the data of all other days except the current return yt .



Stochastic Volatility: WinBUGS Output

Based on 10,000 iterations and burn-in of 10,000 (insufficient):

node mean sd MC error 2.5% median 97.5%beta 0.7163 0.1244 0.00958 0.5554 0.6925 1.005mu -0.6927 0.3074 0.02252 -1.176 -0.735 0.01074phi 0.9805 0.01081 8.306E-4 0.9552 0.9823 0.9962tau 0.1493 0.03052 0.002965 0.1033 0.1435 0.2196



Stochastic Volatility: Final Remarks

This example clearly shows the limitations of the WinBUGS software.The time to generate 1000 observations takes several seconds.Due to the high posterior correlation between parameters,convergence is VERY slow and a huge number of MCMC iterations isrequired to achieve convergence. This takes almost prohibitively long.More efficient samplers than the single-update Gibbs sampler can beconstructed either by so-called blocking parameters and updating awhole parameter vector in a Gibbs sampling step. An alternative is aMetropolis-Hastings algorithm with a multivariate proposal distribution.


4 WinBUGS Applications 4.10 Copulas

Copulas

The study of copulas and their applications in statistics is a rathermodern phenomenon, although the concept goes back to Sklar (1959),but interest in copulas has been growing over the last 15 years.

What are copulas?

The word copula is a Latin noun that means "a link, tie, bond".

In statistics, copulas are functions that join or "couple" multivariatedistribution functions to their one-dimensional marginal distributionfunctions.

Or: Copulas are multivariate distribution functions whoseone-dimensional margins are uniform on the interval (0,1).

An extensive theoretical discussion of copulas can be found in Nelsen(2006).



Applications of Copulas

Copulas are used toI study scale-free measures of dependenceI construct families of bivariate/multivariate distributions (as

alternatives to the multivariate normal, where the normaldistribution does not provide an adequate approximation to manydatasets, e.g. lifetime random variables and long-tailed claimvariables)

Main applications:I in financial risk assessment and actuarial analysis – some believe

the methodology of applying the Gaussian copula to creditderivatives to be one of the reasons behind the global financialcrisis of 2008-2009,

I in engineering for reliability studiesI in biostatistics/epidemiology to model joint survival times of groups

of individuals, e.g. husband and wife, twins, father and son, etc.Prof. Dr. Renate Meyer Applied Bayesian Inference 338


Definition of a Copula

Definition 4.7A copula C(u1, . . . ,ud ) is a multivariate distribution function on the unithypercube [0,1]d with univariate marginal distributions that are alluniform on the interval [0,1], i.e.

C(u1, . . . ,ud ) = P(U1 ≤ u1, . . . ,Ud ≤ ud )

where Ui ∼ Uniform(0,1) for i = 1, . . . ,d .

For ease of notation, we assume from now on that d = 2.



Sklar’s Theorem (1959)

Theorem 4.8Let F be a joint distribution function with margins F1 and F2. Thenthere exists a copula C such that for all x1, x2 ∈ IR

F (x1, x2) = C (F1(x1),F2(x2)) . (4.17)

If F1 and F2 are continuous, then C is unique.Conversely, if C is a copula and F1 and F2 are distribution functions,then the function F defined by (4.17) is a joint distribution function withmargins F1 and F2.



Copula Density

By differentiation, it is easy to show that the density function of abivariate distribution F (x1, x2) = C (F1(x1),F2(x2)) with marginaldensities f1 and f2 is given by

f (x1, x2) = c(F1(x1),F2(x2))f1(x1)f2(x2) (4.18)

where c denotes the copula density of C, i.e.

c(u1,u2) =∂2

∂u1∂u2C(u1,u2)

.



Some Copula Families

Clayton Copula

C(u, v) =(max

(u−α + v−α − 1,0

))−1/αα ∈ [−1,∞)\{0}

Frank Copula

C(u, v) = −1α

log(

1 +(e−αu − 1)(e−αv − 1)

e−α − 1

)α ∈ (−∞,∞)\{0}

Gumbel Copula

C(u, v) = uv exp(−α log u log v) α ∈ (0,1]

Gaussian Copula

C(u, v) = Φρ

(Φ−1(u),Φ−1(v)

)where Φρ is the standard bivariate normal distribution function withcorrelation ρ, and Φ is the standard normal distribution function.



Dependance Measure: Concordance

Informally, a pair of rv’s are concordant if "large" values of one tend tobe associated with "large" values of the other and "small" values of onewith "small" values of the other.

Two observations (xi , yi) and (xj , yj) of a random vector (X ,Y ) areconcordant (discordant) if

I xi < xj and yi < yj , or if xi > xj and yi > yj(xi < xj and yi > yj , or if xi > xj and yi < yj )

I or equivalently:(xi − xj)(yi − yj) > 0((xi − xj)(yi − yj) < 0)



Dependance Measure: Kendall’s tau

The sample version of Kendall’s tau is defined in terms of concordanceas follows:Let (xi , yi), i = 1, . . . ,n denote a random sample of n observations of

(X ,Y ). There are(

n2

)distinct pairs (xi , yi) and (xj , yj) of

observations in the sample, and each pair is either concordant ordiscordant. Let c denote the number of concordant pairs and d thenumber of discordant pairs. Then Kendall’s tau is defined as

τ =c − dc + d

= (c − d)/

(n2

).

The population version of Kendall’s tau is defined as the probability ofconcordance minus the probability of discordance:

τ = P[(X1 − X2)(Y1 − Y2) > 0]− P[(X1 − X2)(Y1 − Y2) < 0]



Relationship: Kendall’s tau and copula parameter

We have the following functional relationships between Kendall’s tauand the parameters of the copula families above:

Clayton τ = 1− 22 + α

Frank τ = 1− 4α

(1− 1

α

∫ α

0

tet − 1

dt)

Gumbel τ = 1− α−1

Gauss τ =2π

arcsin(α)



Parameter Estimation

Flexible multivariate distributions can be constructed withpre-specified, discrete and/or continuous marginal distributions andcopula function that represents the desired dependence structure. Thejoint distribution is usually estimated by a standard two-step procedure

I the marginals are approximated by their empirical distribution orparameters of the marginals are estimated via ML

I the parameters in the copula function are estimated by maximumlikelihood conditional on the parameter estimates in the first step.

Here, we propose to estimate jointly all parameters of marginaldistributions and copula using a Bayesian approach implemented inWinBUGS as in Kelly (2007).



Simulation Study

We use the copula package in R to simulate N = 500 bivariate failuretimes from a Clayton copula with Exponential(λi ) marginal distributionsand a Kendall’s tau value of 0.8 (as a measure for the associationbetween the failure times). The rates for the marginal distributions areλ1 = λ2 = 0.0001.

We use R2WinBUGS to sample from the posterior distribution of theunknown parameters. We use a Jeffreys prior for the rates of theExponential distributions (i.e. approximately Jeffreys withλi ∼ Gamma(0.001,0.001) and we assume a Uniform(0,100) prior forα (based for instance on a priori information that the associationbetween failure times is positive and won’t exceed 0.98).

To specify the likelihood, we need to calculate the density of themultivariate distribution first using (4.18). Exercise!



Simulation Study: R2WinBUGS Code

library(copula)library(R2WinBUGS)p <- 2 # copula dimensiontau <- 0.8 # value of Kendall’s taualpha<-2*tau/(1-tau) #relationship between tau and alphac.clayton<-archmCopula(family="clayton",dim=p,param=alpha)

# Marginals are exponential lambda1 and lambda2lambda1 <- 0.0001lambda2 <- 0.0001distr.clayton<-mvdc(c.clayton, margins=rep("exp",p),

paramMargins = list(list(rate=lambda1),list(rate=lambda2)))

# Draw a random sample of size NN <- 500w <- rmvdc(distr.clayton, N)



Simulation Study

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●●

●

●

●

●

●●

●

● ●

●●

●

●●

●

●

●●

●

●

●

●●●

●

● ●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

● ●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

● ●

●

● ●

●

●

●

●●

0 10000 20000 30000 40000 50000 60000

010

000

2000

030

000

4000

050

000

6000

070

000

w[, 1]

w[, 2

]

Figure 22: Scatterplot of 500 simulated values from Clayton copula withExp(0.0001) marginals.



Implementation in WinBUGS: Zeros Trick

If we want to implement parameter estimation of this copula model inWinBUGS, we face a problem as copula distributions are not includedin the list of standard distributions implemented in WinBUGS.

Fortunately, we can use the so-called zeros trick to specify a newsampling distribution. An observation yi with new sampling distributionf (yi |θ) contributes a likelihood term L(i) = f (yi |θ). Let l(i) = log L(i),then the model likelihood can be written as

f (y1, . . . , yn|θ) =n∏

i=1

f (yi |θ) =n∏

i=1

el(i)

=n∏

i=1

−l(i)0

0!e−(−l(i))

i.e. the product of densities of Poisson random variables withmean= −l(i) and all observations equal to zero.



Implementation in WinBUGS: Zeros Trick

To ensure that the Poisson means are all positive, we may have to adda positive constant C to each −l(i). This is equivalent to multiplyingthe likelihood by a constant term e−nC . With this approach, the originallikelihood can be written as the product of Poisson likelihoods withobservations all equal to zero:

f (y|θ) =n∏

i=1

−(l(i) + C)0

0!e−(−l(i)+C) =

n∏i=1

fPoisson(0| − l(i) + C)

Generic WinBUGS code:

C <- 10000for (i in 1:n){

zeros[i]<-0zeros[i]~ dpois(zeros.mean[i])zeros.mean[i]<- -l[i]+Cl[i]<- ...#expression of log-likelihood for obs. i

}Prof. Dr. Renate Meyer Applied Bayesian Inference 351


Implementation in WinBUGS: Ones Trick

As an alternative to the zeros trick, the Bernoulli distribution can beused. The likelihood can be written as

f (y1, . . . , yn|θ) =n∏

i=1

(el(i))1 (

1− el(i))0

=n∏

i=1

fBernoulli(1|el(i))

i.e. the product of Bernoulli densities with success probability el(i) andall observations equal to 1.



Implementation in WinBUGS: Ones Trick

To ensure that the success probability is less than 1, we multiply eachlikelihood term by e−C where C is a large positive constant. Then thejoint likelihood becomes:

f (y|θ) =n∏

i=1

(el(i)−C

)1 (1− el(i)−C

)0=

n∏i=1

fBernoulli(1|el(i)−C)

Generic WinBUGS code:

C <- 100for (i in 1:n){

ones[i]<-1ones[i]~ dbern(ones.p[i])ones.p[i]<- exp(l[i]-C)l[i]<- ...#expression of log-likelihood for obs. i

}



Simulation Study: R2WinBUGS Code

#Call WinBUGS

data=list(N=500,x=w[,1],y=w[,2])inits=list(list(lambda1=0.001,lambda2=0.002,alpha=5))parameters=c("lambda1","lambda2","alpha")clayton.sim<-bugs(data,inits,parameters.to.save=parameters,

model.file="model_clayton.odc", n.chains=1,n.iter=2000,n.burnin=1000,working.directory=getwd())

This performs 2000 iterations of the Gibbs sampler with a burn-inperiod of 1000 and monitors the values of the three model parameters.The WinBUGS Code in model_clayton.odc is:



Simulation Study: WinBUGS Code

model{lambda1 ~ dgamma(0.001,0.001) #Jeffreys’ priorlambda2 ~ dgamma(0.001,0.001) #Jeffreys’ prioralpha ~ dunif(0,100) #Uniform prior on alpha

# likelihood specification using zeros’ trickC<-10000for(i in 1:N) {zeros[i] <-0zeros[i] ~ dpois(mu[i])mu[i]<- - l[i] +Cu[i] <- 1-exp(-lambda1*x[i])v[i] <- 1-exp(-lambda2*y[i])l[i]<-log((1+alpha)*

pow(pow(u[i],-alpha)+pow(v[i],-alpha)-1,-1/alpha-2)

*pow(u[i],-alpha-1)*pow(v[i],-alpha-1)*lambda1*exp(-lambda1*x[i])*lambda2*exp(-lambda2*y[i])) }}



Simulation Study: WinBUGS Output


node mean sd MC error 2.5% median 97.5%alpha 8.001 0.3863 0.02022 7.279 8.007 8.789deviance 1.002E+7 2.507 0.1517 1.002E+7 1.002E+7 1.002E+7lambda1 9.434E-5 3.815E-6 4.306E-7 8.75E-5 9.401E-5 1.018E-4lambda2 9.415E-5 3.813E-6 4.298E-7 8.723E-5 9.383E-5 1.017E-4


5 References

References I

I Albert, J.H. (2007), Bayesian Computation with R, Springer, NewYork.

I Aitkin, M. (1997), The calibration of P-values, posterior Bayesfactors and the AIC from the posterior distribution of the likelihood,Statistics and Computing 7: 253:272.

I Aitkin, M. (2010), Statistical Inference, An IntegratedBayesian/Likelihood Approach, Chapman& Hall, Cambridge, UK.

I Bellhouse, D.R. (2004), The Reverend Thomas Bayes, FRS: ABiography to Celebrate the Tercentenary of his Birth, StatisticalScience 19, 3-43.

I Berger, J.O. and Wolpert, R.L. (1988) The Likelihood Principle,Hayward, CA.

I Bernardo, J. and Smith, A. (1994) Bayesian Theory, Wiley,Chichester, UK.


5 References

References II

I Bolstad, W.M. (2004) Introduction to Bayesian Statistics, JohnWiley& Sons.

I Borel E. (1921), La Theorie du jeu et les Equations Integrales aNoyau Symetrique, Comptes Rendus de L’Academie desSciences 173 1304-13-8.

I Cai, B., Meyer, R. (2011) Bayesian semiparametric modeling ofsurvival data based on mixtures of B-spline distributions,Computational Statistics and Data Analysis to appear.

I Carlin, B.P., Polson, N.G., and Stoffer, D.S. (1992). A Monte Carloapproach to nonnormal and nonlinear state-space modeling. J.Amer. Statist. Assoc. 87, 493–500.

I Carlin, B.P. and Louis, Th.A. (2008) Bayesian Methods for DataAnalysis, Chapman & Hall.


5 References

References III

I Carlin, B.P. and Hodges, (1999), Hierarchical ProportionalHazards Regression Models for Highly Stratified Data, Biometrics55, 1162-1170.

I Cox, D.R. (1972), Regression models and life tables, Journal ofthe Royal Statistical Society B 34, 187-220.

I Cox, D.R. (1975), Partial Likelihood, Biometrika 62, 269-276.I Cox, D.R. and Oakes, D. (1984) Analysis of Survival Data.

London: Chapman& Hall.I Dempster, A.P. (1974), The direct use of likelihood for significance

testing, in (Barndorff-Nielsen et al, eds.) Proc. of the Conferenceon the Foundational Questions of Statistical Inference,335-352,Reprinted in Statistics and Computing 7, 247-252 (1997).

I Dey, D., Ghosh, S. and Mallick, B. (2000), Generalized LinerModels: A Bayesian Perspective, Marcel Dekker, New York.


5 References

References IV

I Efron, B. (2005), Bayesians, Frequentists, and Scientists, Journalof the American Statistical Association 100.

I Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical ModellingBased on Generalized Linear Models, Springer Series inStatistics, Springer Verlag, New York.

I Fisher, R.A. (1922), On the interpretation of chi-square fromcontingency tables and the calculation of p, Journal of the RoyalStatistical Society B, 85, 87-94.

I Gelfand, A., Dey, D., Chang, H. (1992), Model determination usingpredictive distributions with implementation via sampling-basedmethods, in (Bernardo et al. eds) Bayesian Statistics 4, OxfordUniversity Press, 407-425.

I Gelman, A., Carlin, J., Stern, H., Rubin, D. (2004), Bayesian DataAnalysis, Texts in Statistical Science, 2nd ed., Chapman& Hall,London.


5 References

References V

I Gelman, A. and Meng, X.L. (1996), Model Checking and modelimprovement, in (Gilks et al, eds) Markov Chain Monte Carlo inPractice, Chapman& Hall, UK, 189-201.

I Geman, S. and Geman, D. (1984), Stochastic relaxation, Gibbsdistributions and the Bayesian restoration of images, IEEEtransactions on Pattern Analysis and Machine Intelligence 6,721-741.

I George, E.I., Makov, U.E. and Smith, A.F.M. (1993), ConjugateLikelihood Distributions, Scandinavian Journal of Statistics, 20,147-156.

I Gilks, W., Richardson, S. and Spiegelhalter, D. (1996), MarkovChain Monte Carlo in Practice, Chapman& Hall, Cambridge, UK.

I Ibrahim, J.G., Chen, M-H., Sinha, D. (2001) Bayesian SurvivalAnalysis. Springer, New York.


5 References

References VI

I Jeffreys, H. (1939) Theory of Probability, Oxford University Press,Oxford.

I Jeffresy, H. (1961) Theory of Probability, 3rd edition, OxfordUniversity Press, Oxford.

I Kelly, D.L. (2007), Using Copulas to Model Dependence inSimulation Risk Assessment, Proceedings of InternationalMechanical Engineering Congress and Exposition,IMECE2007-41284.

I Keynes, J.M. (1922) A Treatise on Probability, Volume 8, StMartin’s.

I Klein, J.P. and Moeschberger, M.L. (1997), Survival Analysis, NewYork: Springer.

I Kuensch, H.R. (2001), State space and hidden Markov models, In:Barndorff-Nielsen et al. (Ed.), Complex stochastic systems,Chapman & Hall, London, 109–174.


5 References

References VII

I Lawless J.F. (1982) Statistical Models and Methods for Life TimeData. New York, Wiley.

I McCullagh, P. and Nelder, J. (1989), Generalized Linear Models,Chapman& Hall, Cambridge, UK.

I McCarthy, M.A. (2007) Bayesian Methods for Ecology, CambridgeUniversity Press, 2007.

I Meyer, R. and J. Yu (2000), BUGS for a Bayesian analysis ofstochastic volatility models. Econometrics Journal 3, 198-215.

I Millar, R.B. and Meyer R. (2000), State-Space Modeling ofNon-Linear Fisheries Biomass Dynamics Using the GibbsSampler. Applied Statistics, 49, 327-342.

I Nelsen, R.B. (2006) An Introduction to Copulas, Springer, NewYork.


5 References

References VIII

I Ntzoufras, I. (2009) Bayesian Modeling Using WinBUGS, JohnWiley& Sons, Inc.

I Raiffa, H. and Schlaiffer, R. Applied Statistical Decision Theory,Cambridge, MIT Press.

I Ramsay, F.P. (1926), Truth and Probability, Publised in 1931 asFoundations of Mathematics and Other Logical Essays Ch. VII,156-198.

I Rubin, D.B. (1984), Bayesianly justifiable and relevant frequencycalculations for the applied statistician, Annals of Statistics 12,1151-1172.

I Sklar, A. 91959), Fonctions de repartition a n dimensions e leursmarges. Publ. Inst. Stat. Univ. Paris 8, 229-231.

I Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.(2002), Bayesian measures of model complexity and model fit.Journal of the Royal Statistical Society B 64, 583-639.


applied bayesian inference - kit · 1 introduction 1.1 course overview computing i r – mostly...

Documents