some preliminary q&aph7440/pubh7440/slide1.pdf · some preliminary q&a how does it work? a...

Some preliminary Q&A

What is the philosophical difference between classical(“frequentist”) and Bayesian statistics?

Chapter 1: Approaches for Statistical Inference – p. 1/19



To a frequentist, unknown model parameters arefixed and unknown, and only estimable byreplications of data from some experiment.




To a frequentist, unknown model parameters arefixed and unknown, and only estimable byreplications of data from some experiment.A Bayesian thinks of parameters as random, andthus having distributions (just like the data). We canthus think about unknowns for which no reliablefrequentist experiment exists, e.g.

θ = proportion of US men withuntreated atrial fibrillation



How does it work?



How does it work?A Bayesian writes down a prior guess for θ, p(θ),then combines this with the information that the dataX provide to obtain the posterior distribution of θ,p(θ|X). All statistical inferences (point and intervalestimates, hypothesis tests) then follow asappropriate summaries of the posterior.



How does it work?A Bayesian writes down a prior guess for θ, p(θ),then combines this with the information that the dataX provide to obtain the posterior distribution of θ,p(θ|X). All statistical inferences (point and intervalestimates, hypothesis tests) then follow asappropriate summaries of the posterior.Note that

posterior information ≥ prior information ≥ 0 ,

with the second “≥” replaced by “=” only if the prior isnoninformative (which is often uniform, or “flat”).



Is the classical approach “wrong”?



Is the classical approach “wrong”?While a “hardcore” Bayesian might say so, it isprobably more accurate to think of classical methodsas merely “limited in scope”!



Is the classical approach “wrong”?While a “hardcore” Bayesian might say so, it isprobably more accurate to think of classical methodsas merely “limited in scope”!The Bayesian approach expands the class of modelswe can fit to our data, enabling us to handle

repeated measuresunbalanced or missing datanonhomogenous variancesmultivariate data

– and many other settings that are awkward orinfeasible from a classical point of view.



Is the classical approach “wrong”?While a “hardcore” Bayesian might say so, it isprobably more accurate to think of classical methodsas merely “limited in scope”!The Bayesian approach expands the class of modelswe can fit to our data, enabling us to handle

repeated measuresunbalanced or missing datanonhomogenous variancesmultivariate data

– and many other settings that are awkward orinfeasible from a classical point of view.The approach also eases the interpretation of andlearning from those models once fit.


Simple example of Bayesian thinking

From Business Week, online edition, July 31, 2001:



From Business Week, online edition, July 31, 2001:“Economists might note, to take a simpleexample, that American turkey consumptiontends to increase in November. A Bayesian wouldclarify this by observing that Thanksgiving occursin this month.”




Data: plot of turkey consumption by month





Prior:location of Thanksgiving in the calendarknowledge of Americans’ Thanksgiving eating habits





Prior:location of Thanksgiving in the calendarknowledge of Americans’ Thanksgiving eating habits

Posterior: Understanding of the pattern in the data!


Bayes means revision of estimates

Humans tend to be Bayesian in the sense that most revisetheir opinions about uncertain quantities as more dataaccumulate. For example:

Suppose you are about to make your first submission toa particular academic journal





You assess your chances of your paper being accepted(you have an opinion, but the “true” probability isunknown)






You submit your article and it is accepted!







Question: What is your revised opinion regarding theacceptance probability for papers like yours?







Question: What is your revised opinion regarding theacceptance probability for papers like yours?

If you said anything other than “1”, you are a Bayesian!


Bayes can account for structureCounty-level breast cancer rates per 10,000 women:

79 87 83 80 7890 89 92 99 9596 100 ⋆ 110 115101 109 105 108 11296 104 92 101 96

With no direct data for ⋆, what estimate would you use?



79 87 83 80 7890 89 92 99 9596 100 ⋆ 110 115101 109 105 108 11296 104 92 101 96


Is 200 reasonable?



79 87 83 80 7890 89 92 99 9596 100 ⋆ 110 115101 109 105 108 11296 104 92 101 96


Is 200 reasonable?

Probably not: all the other rates are around 100



79 87 83 80 7890 89 92 99 9596 100 ⋆ 110 115101 109 105 108 11296 104 92 101 96


Is 200 reasonable?

Probably not: all the other rates are around 100

Perhaps use the average of the “neighboring” values(again, near 100)


Accounting for structure (cont’d)Now assume that data become available for county ⋆:100 women at risk, 2 cancer cases. Thus

rate =2

100× 10, 000 = 200

Would you use this value as the estimate?



rate =2

100× 10, 000 = 200


Probably not: The sample size is very small, so thisestimate will be unreliable. How about a compromisebetween 200 and the rates in the neighboring counties?



rate =2

100× 10, 000 = 200



Now repeat this thought experiment if the county ⋆ datawere 20/1000, 200/10000, ...



rate =2

100× 10, 000 = 200



Now repeat this thought experiment if the county ⋆ datawere 20/1000, 200/10000, ...

Bayes and empirical Bayes methods can incorporatethe structure in the data, weight the data and priorinformation appropriately, and allow the data todominate as the sample size becomes large.


Motivating Example

From Berger and Berry (1988, Amer. Scientist):Consider a clinical trial to study the effectiveness ofVitamin C in treating the common cold.


Motivating Example

From Berger and Berry (1988, Amer. Scientist):Consider a clinical trial to study the effectiveness ofVitamin C in treating the common cold.

Observations are matched pairs of subjects (twins?),half randomized (in “double blind” fashion) to vitamin C,half to placebo. We count how many pairs had C givingsuperior relief after 48 hours.


Two DesignsDesign #1: Sample n = 17 pairs, and test

H0 : P (C better) =1

2vs. HA : P (C better) 6= 1

2

Suppose we observe x = 13 preferences for C. Then

p-value = P (X ≥ 13 or X ≤ 4) = .049

So if α = .05, stop and reject H0.


Two DesignsDesign #1: Sample n = 17 pairs, and test

H0 : P (C better) =1

2vs. HA : P (C better) 6= 1

2

Suppose we observe x = 13 preferences for C. Then

p-value = P (X ≥ 13 or X ≤ 4) = .049

So if α = .05, stop and reject H0.

Design #2: Sample n1 = 17 pairs. Then:

if x1 ≥ 13 or x1 ≤ 4, stop.otherwise, sample an additional n2 = 27 pairs.

Reject H0 if X1 + X2 ≥ 29 or X1 + X2 ≤ 15.


Two Designs (cont’d)

We choose this second stage since under H0,P (X1 + X2 ≥ 29 or X1 + X2 ≤ 15) = .049– the same as Stage 1!




Suppose we again observe X1 = 13. Now:

p-value = P (X1 ≥ 13 or X1 ≤ 4)

+P (X1 + X2 ≥ 29 and 4 < X1 < 13)

+P (X1 + X2 ≤ 15 and 4 < X1 < 13)

= .085 ← no longer significant at α = .05!




Suppose we again observe X1 = 13. Now:

p-value = P (X1 ≥ 13 or X1 ≤ 4)

+P (X1 + X2 ≥ 29 and 4 < X1 < 13)

+P (X1 + X2 ≤ 15 and 4 < X1 < 13)

= .085 ← no longer significant at α = .05!

Yet the observed data was exactly the same; all we didwas contemplate a second stage (no effect on data),and it changed our answer!


Additional Q & A

Q: What if we kept adding stages?


Additional Q & A

Q: What if we kept adding stages?A: p-value→ 1, even though x1 still 13!


Additional Q & A


Q: So are p-values really “objective evidence”?


Additional Q & A


Q: So are p-values really “objective evidence”?A: No, since extra info (like design) critical!


Additional Q & A



Q: What about unforeseen events?Example: First 5 patients develop an allergic reaction tothe treatment – trial is stopped by clinicians.Can a frequentist analyze these data?


Additional Q & A




A: No: This aspect of design wasn’t anticipated, sop-values not computable!


Additional Q & A





Q: Can a Bayesian?


Additional Q & A





Q: Can a Bayesian?A: (obviously) Yes – as we shall see.....


Bayes/frequentist controversy

Frequentist outlook: Suppose we have k unknownquantities θ = (θ1, θ2, . . . , θk), and data X = (X1, ..., Xn)with distribution which depends on θ, say p(x|θ).




The frequentist selects a loss function, L(θ, a), anddevelops procedures which perform well with respect tofrequentist risk,

R(θ, δ) = EX|θ[L(θ, a)]




The frequentist selects a loss function, L(θ, a), anddevelops procedures which perform well with respect tofrequentist risk,

R(θ, δ) = EX|θ[L(θ, a)]

A good frequentist procedure is one that enjoys low lossover repeated data sampling regardless of the truevalue of θ.


Bayes/frequentist controversySo are there any such procedures? Sure:

Example: Suppose Xi|θ iid∼ N(θ, σ2), i = 1, . . . , n.Standard 95% frequentist confidence interval (CI) isδ(x) = (x̄± 1.96 s/

√n).




√n).

If we choose the loss function

L(θ, δ) =

{

0 if θ ∈ δ(x)

1 if θ 6∈ δ(x), then

R[(θ, σ), δ] = Ex|θ,σ[L(θ, δ(x))] = Pθ,σ(θ 6∈ δ(x)) = .05




√n).


L(θ, δ) =

{

0 if θ ∈ δ(x)



On average in repeated use, δ fails only 5% of the time.




√n).


L(θ, δ) =

{

0 if θ ∈ δ(x)



On average in repeated use, δ fails only 5% of the time.

Has some appeal, since it works for all θ and σ!



Many frequentist analyses based on the likelihoodfunction, which is just p(x|θ) viewed as a function of θ:

L(θ;x) ≡ p(x|θ)

Larger p(x|θ) for values of θ which are more “likely”.Leads naturally to...




L(θ;x) ≡ p(x|θ)


The Likelihood Principle: In making inferences ordecisions about θ after x is observed, all relevantexperimental information is contained in the likelihoodfunction for the observed x. Furthermore, two likelihoodfunctions contain the same information about θ if theyare proportional to each other as functions of θ.




L(θ;x) ≡ p(x|θ)


The Likelihood Principle: In making inferences ordecisions about θ after x is observed, all relevantexperimental information is contained in the likelihoodfunction for the observed x. Furthermore, two likelihoodfunctions contain the same information about θ if theyare proportional to each other as functions of θ.

Seems completely reasonable (and theoreticallyjustifiable) – yet frequentist analyses may violate!...


Binomial vs. Negative BinomialExample due to Pratt (comment on Birnbaum, 1962 JASA):Suppose 12 independent coin tosses: 9H, 3T. Test:

H0 : θ = 1

2vs. HA : θ > 1

2

Two possibilities for f(x|θ):



H0 : θ = 1

2vs. HA : θ > 1

2

Two possibilities for f(x|θ):Binomial: n = 12 tosses (fixed beforehand)⇒ X = #H ∼ Bin(12, θ)

⇒ L1(θ) = p1(x|θ) =(

nx

)

θx(1− θ)n−x

=(

12

9

)

θ9(1− θ)3.



H0 : θ = 1

2vs. HA : θ > 1

2


⇒ L1(θ) = p1(x|θ) =(

nx

)

θx(1− θ)n−x

=(

12

9

)

θ9(1− θ)3.

Negative Binomial: Flip until we get r = 3 tails⇒ X ∼ NB(3, θ)

⇒ L2(θ) = p2(x|θ) =(

r+x−1

x

)

θx(1− θ)r

=(

11

9

)

θ9(1− θ)3.



H0 : θ = 1

2vs. HA : θ > 1

2


⇒ L1(θ) = p1(x|θ) =(

nx

)

θx(1− θ)n−x

=(

12

9

)

θ9(1− θ)3.

Negative Binomial: Flip until we get r = 3 tails⇒ X ∼ NB(3, θ)

⇒ L2(θ) = p2(x|θ) =(

r+x−1

x

)

θx(1− θ)r

=(

11

9

)

θ9(1− θ)3.

Adopt the rejection region, “Reject H0 if X ≥ c.”


Binomial vs. Negative Binomialp-values:



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075

α2 = Pθ= 1

2

(X ≥ 9) = Σ∞j=9

(

2+jj

)

θj(1− θ)3 = .0325



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075

α2 = Pθ= 1

2

(X ≥ 9) = Σ∞j=9

(

2+jj

)

θj(1− θ)3 = .0325

So at α = .05, two different decisions! Violates theLikelihood Principle, since L1(θ) ∝ L2(θ)!!



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075

α2 = Pθ= 1

2

(X ≥ 9) = Σ∞j=9

(

2+jj

)

θj(1− θ)3 = .0325


What happened? Besides the observed x = 9, we alsotook into account the “more extreme” X ≥ 10.



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075

α2 = Pθ= 1

2

(X ≥ 9) = Σ∞j=9

(

2+jj

)

θj(1− θ)3 = .0325



Jeffreys (1961): “...a hypothesis which may be true maybe rejected because it has not predicted observableresults which have not occurred.”



α1 = Pθ= 1

2

(X ≥ 9) = Σ12j=9

(

12

j

)

θj(1− θ)12−j = .075

α2 = Pθ= 1

2

(X ≥ 9) = Σ∞j=9

(

2+jj

)

θj(1− θ)3 = .0325



Jeffreys (1961): “...a hypothesis which may be true maybe rejected because it has not predicted observableresults which have not occurred.”

In our example, the probability of the unpredicted andnonoccurring set X ≥ 10 has been used as evidenceagainst H0 !


Conditional (Bayesian) Perspective

Always condition on data which has actually occurred;the long-run performance of a procedure is of (at most)secondary interest. Fix a prior distribution p(θ), and useBayes’ Theorem (1763):

p(θ|x) ∝ p(x|θ)p(θ)

(“posterior ∝ likelihood× prior”)






Indeed, it often turns out that using the Bayesianformalism with relatively vague priors producesprocedures which perform well using traditionalfrequentist criteria (e.g., low mean squared error overrepeated sampling)!






Indeed, it often turns out that using the Bayesianformalism with relatively vague priors producesprocedures which perform well using traditionalfrequentist criteria (e.g., low mean squared error overrepeated sampling)!– several examples in Chapter 4 of the C&L text!


Frequentist criticisms

[from B. Efron (1986, Amer. Statist.), “Why isn’t everyone aBayesian?”]

Shouldn’t condition on x (but this renews conflict withthe Likelihood Principle)





Not easy/automatic (but computing keeps improving:MCMC methods, WinBUGS software....)






How to pick the prior p(θ): Two experimenters could getdifferent answers with the same data! How to controlinfluence of the prior? How to get objective results (say,for a court case, scientific report,....)?






How to pick the prior p(θ): Two experimenters could getdifferent answers with the same data! How to controlinfluence of the prior? How to get objective results (say,for a court case, scientific report,....)?

⇒ Clearly this final criticism is the most serious!


Bayesian Advantages in InferenceAbility to formally incorporate prior information



The reason for stopping experimentation does not affectthe inference




Answers are more easily interpretable by nonspecialists(e.g. confidence intervals)





All analyses follow directly from the posterior; noseparate theories of estimation, testing, multiplecomparisons, etc. are needed






Any question can be directly answered (bioequivalence,multiple comparisons/hypotheses, ...)







Inferences are conditional on the actual data







Inferences are conditional on the actual data

Bayes procedures possess many optimality properties(e.g. consistent, impose parsimony in model choice,define the class of optimal frequentist procedures, ...)


some preliminary q&aph7440/pubh7440/slide1.pdf · some preliminary q&a how does it work? a...

Documents