kevin murphy ubc cs & stats 9 february 2005

68
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

Upload: wanda-beard

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful. Kevin Murphy UBC CS & Stats 9 February 2005. Where does the title come from?. “Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kevin Murphy UBC  CS & Stats 9 February 2005

Why I am a Bayesian(and why you should become one, too)

orClassical statistics considered harmful

Kevin MurphyUBC CS & Stats

9 February 2005

Page 2: Kevin Murphy UBC  CS & Stats 9 February 2005

Where does the title come from?

• “Why I am not a Bayesian”, Glymour, 1981

• “Why Glymour is a Bayesian”, Rosenkrantz, 1983

• “Why isn’t everyone a Bayesian?”,Efron, 1986

• “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001

Many other such philosophical essays…

Page 3: Kevin Murphy UBC  CS & Stats 9 February 2005

Frequentist vs Bayesian

• Prob = objective relative frequencies

• Params are fixed unknown constants, so cannot write e.g. P(=0.5|D)

• Estimators should be good when averaged across many trials

• Prob = degrees of belief (uncertainty)

• Can write P(anything|D)

• Estimators should be good for the available data

Source: “All of statistics”, Larry Wasserman

Page 4: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Page 5: Kevin Murphy UBC  CS & Stats 9 February 2005

Coin flipping

HHTHT

HHHHHWhat process produced these sequences?

The following slides are from Tenenbaum & Griffiths

Page 6: Kevin Murphy UBC  CS & Stats 9 February 2005

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

HHTHTD =

statisticalmodels

Page 7: Kevin Murphy UBC  CS & Stats 9 February 2005

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

generativemodels

HHTHTD =

Page 8: Kevin Murphy UBC  CS & Stats 9 February 2005

Representing generative models

• Graphical model notation– Pearl (1988), Jordan (1998)

• Variables are nodes, edges indicate dependency

• Directed edges show causal process of data generation

HHTHTd1 d2 d3 d4 d5

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

Markov model

Page 9: Kevin Murphy UBC  CS & Stats 9 February 2005

Models with latent structure

• Not all nodes in a graphical model need to be observed

• Some variables reflect latent structure, used in generating D but unobserved

HHTHTd1 d2 d3 d4 d5 d1 d2 d3 d4

Hidden Markov model

s1 s2 s3 s4

d1 d2 d3 d4

P(H) = p

p

How do we select the “best” model?

Page 10: Kevin Murphy UBC  CS & Stats 9 February 2005

Bayes’ rule

Hh

hphdp

hphdpdhp

)()|(

)()|()|(

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypotheses

Page 11: Kevin Murphy UBC  CS & Stats 9 February 2005

The origin of Bayes’ rule

• A simple consequence of using probability to represent degrees of belief

• For any two random variables:

)|()()&(

)|()()&(

BApBpBAp

ABpApBAp

)|()()|()( ABpApBApBp

)(

)|()()|(

Bp

ABpApBAp

Page 12: Kevin Murphy UBC  CS & Stats 9 February 2005

• Good statistics– consistency, and worst-case error bounds.

• Cox Axioms– necessary to cohere with common sense

• “Dutch Book” + Survival of the Fittest– if your beliefs do not accord with the laws of probability, then you

can always be out-gambled by someone whose beliefs do so accord.

• Provides a theory of incremental learning– a common currency for combining prior knowledge and the lessons

of experience.

Why represent degrees of belief with probabilities?

Page 13: Kevin Murphy UBC  CS & Stats 9 February 2005

Hypotheses in Bayesian inference

• Hypotheses H refer to processes that could have generated the data D

• Bayesian inference provides a distribution over these hypotheses, given D

• P(D|H) is the probability of D being generated by the process identified by H

• Hypotheses H are mutually exclusive: only one process could have generated D

Page 14: Kevin Murphy UBC  CS & Stats 9 February 2005

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Page 15: Kevin Murphy UBC  CS & Stats 9 February 2005

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Page 16: Kevin Murphy UBC  CS & Stats 9 February 2005

Comparing two simple hypotheses

• Contrast simple hypotheses:– H1: “fair coin”, P(H) = 0.5

– H2:“always heads”, P(H) = 1.0

• Bayes’ rule:

• With two hypotheses, use odds form

)(

)|()()|(

DP

HDPHPDHP

Page 17: Kevin Murphy UBC  CS & Stats 9 February 2005

Bayes’ rule in odds form

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2) = x

Posterior odds Bayes factor(likelihood ratio)

Prior odds

Page 18: Kevin Murphy UBC  CS & Stats 9 February 2005

Data = HHTHT

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHTHTH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 0 P(H2) = 1/1000

P(H1|D) / P(H2|D) = infinity

= x

Page 19: Kevin Murphy UBC  CS & Stats 9 February 2005

Data = HHHHH

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 30

= x

Page 20: Kevin Murphy UBC  CS & Stats 9 February 2005

Data = HHHHHHHHHH

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHHHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/210 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 1

= x

Page 21: Kevin Murphy UBC  CS & Stats 9 February 2005

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Page 22: Kevin Murphy UBC  CS & Stats 9 February 2005

Comparing simple and complex hypotheses

• Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

P(H) = p

p

vs.

Page 23: Kevin Murphy UBC  CS & Stats 9 February 2005

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p

such that X is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses

Page 24: Kevin Murphy UBC  CS & Stats 9 February 2005

Comparing simple and complex hypotheses

Pro

babi

lity

Page 25: Kevin Murphy UBC  CS & Stats 9 February 2005

Comparing simple and complex hypotheses

Pro

babi

lity

HHHHH p = 1.0

Page 26: Kevin Murphy UBC  CS & Stats 9 February 2005

Comparing simple and complex hypotheses

Pro

babi

lity

HHTHT p = 0.6

Page 27: Kevin Murphy UBC  CS & Stats 9 February 2005

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p such

that X is more probable than if P(H) = 0.5

• How can we deal with this?– frequentist: hypothesis testing– information theorist: minimum description length– Bayesian: just use probability theory!

Comparing simple and complex hypotheses

Page 28: Kevin Murphy UBC  CS & Stats 9 February 2005

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

Page 29: Kevin Murphy UBC  CS & Stats 9 February 2005

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

likelihood PriorMarginal likelihood

Page 30: Kevin Murphy UBC  CS & Stats 9 February 2005

Likelihood and prior

• Likelihood:

P(D | p) = pNH (1-p)NT

– NH: number of heads– NT: number of tails

• Prior:

P(p) pFH-1 (1-p)FT-1 ?

Page 31: Kevin Murphy UBC  CS & Stats 9 February 2005

A simple method of specifying priors

• Imagine some fictitious trials, reflecting a set of previous experiences– strategy often used with neural networks

• e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair

• In fact, this is a sensible statistical idea...

Page 32: Kevin Murphy UBC  CS & Stats 9 February 2005

Likelihood and prior• Likelihood:

P(D | p) = pNH (1-p)NT

– NH: number of heads– NT: number of tails

• Prior:

P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads– FT: fictitious observations of tails

Beta(FH,FT)(pseudo-counts)

Page 33: Kevin Murphy UBC  CS & Stats 9 February 2005

Posterior / prior x likelihood• Prior

• Likelihood

• Posterior Same form!

Page 34: Kevin Murphy UBC  CS & Stats 9 February 2005

Conjugate priors

• Exist for many standard distributions– formula for exponential family conjugacy

• Define prior in terms of fictitious observations

• Beta is conjugate to Bernoulli (coin-flipping)

FH = FT = 1 FH = FT = 3FH = FT = 1000

Page 35: Kevin Murphy UBC  CS & Stats 9 February 2005

Normalizing constants• Prior• Normalizing constant for Beta distribution

• Posterior

• Hence marginal likelihood is

Page 36: Kevin Murphy UBC  CS & Stats 9 February 2005

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

Marginal likelihood (“evidence”) for H2

Likelihood for H1

Page 37: Kevin Murphy UBC  CS & Stats 9 February 2005

Marginal likelihood for H1 and H2

Pro

babi

lity

Marginal likelihood is an average over all values of p

Page 38: Kevin Murphy UBC  CS & Stats 9 February 2005

Sensitivity to hyper-parameters

Page 39: Kevin Murphy UBC  CS & Stats 9 February 2005

• Simple and complex hypotheses can be compared directly using Bayes’ rule– requires summing over latent variables

• Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor”

• Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

Bayesian model selection

Page 40: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Page 41: Kevin Murphy UBC  CS & Stats 9 February 2005

Example: Belgian euro-coins

• A Belgian euro spun N=250 times came up heads X=140.

• “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002)

Source: Mackay exercise 3.15

Page 42: Kevin Murphy UBC  CS & Stats 9 February 2005

Classical hypothesis testing

• Null hypothesis H0 eg. = 0.5 (unbiased coin)

• For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1: 0.5

• Need a decision rule that maps data D to accept/ reject of H0.

• Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

Page 43: Kevin Murphy UBC  CS & Stats 9 February 2005

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

Page 44: Kevin Murphy UBC  CS & Stats 9 February 2005

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

• Usually choose so that false rejection rate of H0 is below significance level = 0.05

R

Page 45: Kevin Murphy UBC  CS & Stats 9 February 2005

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

• Usually choose so that false rejection rate of H0 is below significance level = 0.05

• Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1

R

Page 46: Kevin Murphy UBC  CS & Stats 9 February 2005

P-value for euro coins

• N = 250 trials, X=140 heads

• P-value is “less than 7%”

• If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%.

• This does not mean P(H0|D)=0.07!

Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

Page 47: Kevin Murphy UBC  CS & Stats 9 February 2005

Bayesian analysis of euro-coin

• Assume P(H0)=P(H1)=0.5

• Assume P(p) ~ Beta(,)

• Setting =1 yields a uniform (non-informative) prior.

Page 48: Kevin Murphy UBC  CS & Stats 9 February 2005

Bayesian analysis of euro-coin

• If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased).

• By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis.

• Other priors yield similar results.• Bayesian analysis contradicts classical

analysis.

Page 49: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Page 50: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

Page 51: Kevin Murphy UBC  CS & Stats 9 February 2005

The likelihood principle• In order to choose between hypotheses H0

and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as

• This principle can be proved from two simpler principles called conditionality and sufficiency.

Page 52: Kevin Murphy UBC  CS & Stats 9 February 2005

Frequentist statistics violates the likelihood principle

• “The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961

Page 53: Kevin Murphy UBC  CS & Stats 9 February 2005

Another example

• Suppose X ~ N(,2); we observe x=3

• Compare H0: =0 with H1: >0

• P-value = P(X ¸ 3|H0)=0.001, so reject H0

• Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1

Page 54: Kevin Murphy UBC  CS & Stats 9 February 2005

When are P-values valid?• Suppose X ~ N(,2); we observe X=x.

• One-sided hypothesis test: H0: ·

0 vs H1: > 0

• If P() / 1, then P(|x) ~ N(x,2), so

• P-value is the same in this case, since Gaussian is symmetric in its arguments

Page 55: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

Page 56: Kevin Murphy UBC  CS & Stats 9 February 2005

Stopping rule principle

• Inferences you make should only depend on the observed data, not the reasons why this data was collected.

• If you look at your data to decide when to stop collecting, this should not change any conclusions you draw.

• Follows from likelihood principle.

Page 57: Kevin Murphy UBC  CS & Stats 9 February 2005

Frequentist statistics violates stopping rule principle

• Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)?

• Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:

Page 58: Kevin Murphy UBC  CS & Stats 9 February 2005

Frequentist statistics violates stopping rule principle

• Suppose the data was generated by tossing coins until we got X=3 heads.

• Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias!

First n-1 trials contain x-1 heads; last trial always heads

Page 59: Kevin Murphy UBC  CS & Stats 9 February 2005

Ignoring stopping criterion can mislead classical estimators

• Let Xi ~ Bernoulli()• Max lik. estimator• MLE is unbiased:• Toss coin; if head, stop, else toss second coin.

P(H)=, P(HT)= (1-), P(TT)=(1-)2.

• Now MLE is biased!

• Many classical rules for assessing significance when complex stopping rules are used.

Page 60: Kevin Murphy UBC  CS & Stats 9 February 2005

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

Page 61: Kevin Murphy UBC  CS & Stats 9 February 2005

Confidence intervals

• An interval (min(D),max(D)) is a 95% CI if lies inside this interval 95% of the time across repeated draws D~P(.|)

• This does not mean P( 2 CI|D) = 0.95!

Mackay sec 37.3

Page 62: Kevin Murphy UBC  CS & Stats 9 February 2005

Example• Draw 2 integers from

• If =39, we would expect

Page 63: Kevin Murphy UBC  CS & Stats 9 February 2005

Example• If =39, we would expect

• Define confidence interval as

• eg (x1,x2)=(40,39), CI=(39,39)• 75% of the time, this will contain the true

Page 64: Kevin Murphy UBC  CS & Stats 9 February 2005

CIs violate common sense• If =39, we would expect

• If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5

• If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.

Page 65: Kevin Murphy UBC  CS & Stats 9 February 2005

What’s wrong with the classical approach?

• Violates likelihood principle

• Violates stopping rule principle

• Violates common sense

Page 66: Kevin Murphy UBC  CS & Stats 9 February 2005

What’s right about the Bayesian approach?

• Simple and natural

• Optimal mechanism for reasoning under uncertainty

• Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false

• Supports interesting (human-like) kinds of learning

Page 67: Kevin Murphy UBC  CS & Stats 9 February 2005
Page 68: Kevin Murphy UBC  CS & Stats 9 February 2005

Bayesian humor

• “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”