comp6053 lecture: bayes' theorem and bayesian...

FEEG6017 lecture: Bayes' theorem and Bayesian estimation Dr Brendan Neville [email protected]

Upload: phamdang

Post on 25-Jun-2018




0 download


FEEG6017 lecture:

Bayes' theorem and

Bayesian estimation

Dr Brendan Neville

[email protected]

Null hypothesis significance

testing: what are we doing?

• When we report on a p-value, we're

describing the probability of observing our

data (or more extreme data) given the

assumption that the null hypothesis is true.

• The convention is that if this probability is

low enough, we decide to reject the null

hypothesis and tentatively adopt the

alternative hypothesis.

Null hypothesis significance

testing: what are we doing?

• The difficulty in interpreting p-values is that

it's tempting to see them as somehow

describing the probability of the model being


• (A natural confusion: the appropriateness of

our model is what we really care about.)

• But p-values don't do this, they describe the

probability of the data given a very boring

model, the null hypothesis.

• P(Data|Null)

The null hypothesis: over-rated?

• In fact we're often not much interested in the

null hypothesis: in almost all realistic cases

it's not true.

• We could demonstrate that by collecting

more data, and/or measuring more precisely.

• The null hypothesis has had a starring role in

the history of statistics simply because it is

mathematically convenient.

Integrating new evidence

• It's a familiar scientific activity to report on

the p-value of some analysis and, if it's low

enough, to publish the findings.

• But how should we integrate new evidence

over time though?

Integrating new evidence

• Suppose many scientists investigate some


• Most find that an effect exists, e.g., a

positive relationship between basketball skill

and height.

• Some analyses have very low p-values,

others are marginal, still others are non-


Integrating new evidence

• How should we rationally combine the

conclusions of these studies?

• In fact in the real world of publication, there's

reason to be concerned that we don't do this

at all rationally...

• Some fields tend to use significance levels of

p=0.01 or p=0.05 as a threshold for


Dangers of publication bias

• There's also an interest in publishing "new"

and "exciting" results.

• John Ioannidis's paper "Why most published

research findings are false" points out that

NHST combined with publication bias is a

recipe for disaster.

• Half-life of knowledge.

• Nevertheless: most of us have a pragmatic

sense of evidence somehow accumulating

for (or against) a theory over time.

Frequentist thinking

• There are several schools of thought on


• One is the "frequentist" view, which says

probabilities can only refer to the objective

long-run frequency of an event occurring in a

well-defined sample space.

• Particular events either happen or they don't.

Probability and belief

• Frequentists disapprove of using

probabilities to refer to subjective belief.

• For example: "I am 90% sure that he is the

one who stole the coffee money." Are we

happy with this kind of talk?

• Bayesian thinking: if we allow probabilities to

refer to subjective belief, this turns out to

help with the integration of new information.

A medical example

• You're a doctor. A

patient comes in, asks

for an HIV test.

• You get some clinical

details from them.

• HIV is rare in patients

with this behavioural

profile: about 1 case

in 1000.

A medical example

• You think they're probably just being

paranoid, but you take a blood sample


• Send it off for analysis. The lab test is quite


o 99% sensitivity: probability of getting a positive result

if the patient has HIV is 0.99.

o 99% specificity: probability of getting a negative

result if the patient doesn't have HIV is also 0.99.

A medical example

• To your surprise, the test comes back


• The patient is understandably dismayed, and

asks "Could it be a mistake?". What is the

probability that the patient has HIV?

• If you haven't seen this kind of problem

before, take a minute to think about your


A medical example

• The real answer is about 0.0902, or 9.02%.

• Only around 15% of doctors get this right.

• Many respondents focus on the 99%

sensitivity of the test, and believe that the

patient is 99% likely to have HIV given the

positive result.

• They're neglecting the background or base

rate of HIV prevalence.

Test example with frequencies

• Doctors (and others) do a better job on the

problem if it is framed differently.

• Consider a population of 100,000 people

who each decide to have an HIV test.

• HIV is rare in this population: 100 people

have it, and 99,900 people do not.

Test example with frequencies

• Of the 100 people with HIV, the test will

accurately detect HIV in 99 of them, and 1

person will get a false negative result.

• Of the 99,900 people without HIV, 99% of

them (98,901) will get a negative result. The

remainder (999) will get a false positive


• There will be 1098 positive results in total:

99 are true, and 999 are false.

Test example with frequencies

• The probability of actually having HIV after

getting a positive test result is therefore

99 / 1098 = 0.0902.

• (Do you agree that this version makes the

problem easier?)

• The logic expressed here is Bayes' theorem.

Bayes' theorem: formula

𝑃 𝐻 𝐷 =

𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃 𝐷 𝐻 ∙ 𝑃 𝐻 + 𝑃 𝐷 𝐻𝐶 ∙ 𝑃(𝐻𝐶)

• Let's say H represents having HIV, and D

represents the positive test result.

• We're trying to calculate "the probability of H

given that D has been observed".

Bayes' theorem: numerator

𝑃(𝐷|𝐻) ∙ 𝑃(𝐻) • What's p(D|H)? That's the probability of

seeing a positive result if you really have

HIV, i.e., the sensitivity of the test, 0.99.

• What's p(H)? This is the base rate or prior

probability that the person has HIV. In this

case, 0.001.

• Note that Bayesian thinking demands that

we have some prior opinion on p(H).

Bayes' theorem: numerator

• So the numerator is:

𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

0.99 x 0.001 = 0.00099

• This is the probability of any one person both

having HIV and also returning a positive

result on the test.

Bayes' theorem: denominator

𝑃 𝐷 𝐻 ∙ 𝑃 𝐻 + 𝑃 𝐷 𝐻𝐶 ∙ 𝑃(𝐻𝐶)

• The first component of the denominator

simply repeats the numerator.

• This is "how often does someone have HIV

and then get a positive test result“.

• The second component looks at the other

way you can get a positive test result, i.e.,

via a false positive.

Bayes' theorem: denominator

• We need to know P(D|HC), i.e., the

probability of seeing a positive test result if

you don't have HIV. That's 0.01, the

complement of the specificity.

• We also want to know P(HC), the prior

probability of not having HIV. That's 0.999.

• Multiply them to get the overall rate of false

positives: 0.01 x 0.999 = 0.00999.

Bayes' theorem: denominator

• We add these two components to find out

how often positive test results will be seen

for any reason:

𝑃 𝐷 𝐻 ∙ 𝑃 𝐻 + 𝑃 𝐷 𝐻𝐶 ∙ 𝑃(𝐻𝐶)

0.00099 + 0.00999 = 0.01098

• So about 1% of the time we'll see positive

test results.

• How often will those be true positives? In

other words, what's p(H|D)?

Bayes' theorem:

putting it all together

• The probability of seeing a true positive


... divided by the overall probability of seeing

a positive test result of either sort (0.01098)

... gives the probability of actually having HIV

given the observation of a positive test result


As a Tree







𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

As a Tree







𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)

𝑃 𝐷 𝐻𝐶 ∙ 𝑃(𝐻𝐶)

𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

As a Tree







𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)


As a Graph







𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)


As a Graph







𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)

𝑃(𝐷|𝐻𝐶) ∙ 𝑃(𝐻𝐶)

𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)


As a Graph








𝑃(𝐷𝐶|𝐻) ∙ 𝑃(𝐻)


𝑃(𝐷𝐶|𝐻𝐶) ∙ 𝑃(𝐻𝐶)


Bayes' theorem: formula

𝑃 𝐻 𝐷 =𝑃(𝐷|𝐻) ∙ 𝑃(𝐻)

𝑃 𝐷 𝐻 ∙ 𝑃 𝐻 + 𝑃 𝐷 𝐻𝐶 ∙ 𝑃(𝐻𝐶)

𝑃 𝐻 𝐷 =0.99 x 0.001

(0.99 x 0.001)+(0.01 x 0.999)

𝑃(𝐻|𝐷) =0.00099

0.01098= 0.09016

Bayes' theorem

𝑃 𝑇ℎ𝑒𝑜𝑟𝑦 𝐷𝑎𝑡𝑎 =

𝑃(𝐷𝑎𝑡𝑎|𝑇ℎ𝑒𝑜𝑟𝑦) ∙ 𝑃(𝑇ℎ𝑒𝑜𝑟𝑦)


• A simpler version of the formula.

• The denominator simplifies to the overall

probability of seeing the observed data.

Rethinking the test example

• Let's view the doctor as a scientist who is

collecting data. He starts with the estimated

prior probability for a patient to have HIV of


• He makes a measurement (i.e., the HIV


• The measurement forces a change in the

estimated probability that the patient has


Rethinking the test example

• Bayes' theorem spells out the rational way

for the doctor to update his prior probability

for HIV in the light of the new evidence.

• In the jargon, this gives us a new posterior

probability, i.e., an estimate after the new

information has been taken into account.

• And in fact the estimated probability has

jumped hugely, from 0.001 to 0.09.

Another example: finding the mole

• This is a story in the style of John le Carré in

order to make clear the link between Bayes'

theorem and the revision or updating of

scientific theories.

• So: you're the head of MI6. You're pretty

sure there's a "mole" in your organization.

Another example: finding the mole

• You've narrowed

it down to five

suspects: Alan,

Bob, Chris,

Dave, and Ed.

Finding the mole

• You have all five arrested and begin to

interrogate them.

• You know from previous experience with

interrogations that there are five behaviours

to be expected in any given session: normal

behaviour, nervousness, anger at the

accusation, making a mistake in one's story,

and a desperate exhausted confession.

Finding the mole

• However, none of these five behaviours will

completely settle the question.

• Both moles and loyal operatives will exhibit

any of these, even confession.

• However, you know from experience that

moles and loyal operatives will exhibit the

five behaviours at different rates.

Behaviours of loyal operatives

Behaviours of moles

Prior probabilities

• Perhaps you have no idea who the mole is,

but are convinced that it must be one of the


• The probability-of-being-the-mole is 0.2 for

each person in this case. This is called a

uniform prior.

Iterative Bayesian reasoning

• We begin the interrogation sessions.

• After each session, we update our prior

probability estimate of each person being the

mole using Bayes' theorem.

• We then return to the questioning, but

today's posterior becomes tomorrow's prior.

Iterative Bayesian reasoning

Session 1

Alan (0.2) Normal 0.164

Bob (0.2) Normal 0.164

Chris (0.2) Confess 0.333

Dave (0.2) Normal 0.164

Ed (0.2) Normal 0.164

Session 2

Alan (0.164) Normal 0.134

Bob (0.164) Mistake 0.164

Chris (0.333) Nervous 0.429

Dave (0.164) Normal 0.134

Ed (0.164) Confess 0.282

Uniform Priors


Iterative Bayesian reasoning

Session 27

Alan (0.222) Confess 0.363

Bob (0.018) Nervous 0.026

Chris (0.195) Angry 0.326

Dave (0.009) Mistake 0.009

Ed (0.26) Normal 0.216

Session 28

Alan (0.363) Angry 0.533

Bob (0.026) Normal 0.021

Chris (0.326) Normal 0.275

Dave (0.009) Normal 0.007

Ed (0.216) Angry 0.356

Iterative Bayesian reasoning

Session 150

Alan (0.0) Normal 0.0

Bob (0.0) Normal 0.0

Chris (0.999) Confess 1.0

Dave (0.001) Confess 0.001

Ed (1.0) Normal 1.0

• The truth is there are two moles: Chris and

Ed. After enough sessions, our probability

estimates reflect this.

Prior probabilities

• But perhaps you are not so agnostic:

o Alan is your oldest friend; you can't believe it could

be him. You assign him a prior of 0.001.

o Bob seems unlikely, but you never know: 0.1.

o Chris, well, you never liked his face: 0.5.

o Dave has been taking a lot of mysterious holidays to

Moscow: 0.75.

o Ed: surely not? 0.05.

• The probabilities don't add to 1.0 as there

just might be more than one of them!

Starting with non-uniform priors

Session 1

Alan (0.001) Normal 0.001

Bob (0.1) Normal 0.08

Chris (0.5) Confess 0.667

Dave (0.75) Normal 0.702

Ed (0.05) Normal 0.04

Session 2

Alan (0.001) Normal 0.001

Bob (0.08) Mistake 0.08

Chris (0.667) Nervous 0.75

Dave (0.702) Normal 0.649

Ed (0.04) Confess 0.076

Starting with non-uniform priors

Session 150

Alan (0.0) Normal 0.0

Bob (0.0) Normal 0.0

Chris (1.0) Confess 1.0

Dave (0.009) Confess 0.017

Ed (1.0) Normal 1.0

• We get to the same estimates eventually.

We started out badly wrong about Dave and

Ed, but with enough data, our priors don't


Bayesian statistics?

• Explicitly Bayesian statistical procedures

exist, in which new data is used to update


o These were not really practical before the era of

computational statistical tools.

o Often used in machine learning or artificial

intelligence: e.g., how should a robot use sensory

input to update its estimate of where the target is?

Bayesian statistics?

• Bayes-inspired procedures exist, e.g., the

Bayesian Information Criterion; similar to the

AIC measure.

• Bayes as a mindset.

o Comparing models and searching for the current

best one is a better statistical practice than repeated

use of NHST.

o Preferred model will change as more data comes in.

Additional material

• Great online intro to Bayesian thinking.

• Python program used to produce the "find

the mole" example.