probability/statistics lecture notes 4: hypothesis testing

PHIL 6334 - Probability/Statistics Lecture Notes 4:

Hypothesis Testing

Aris Spanos [Spring 2014]

1 Introduction

After estimation (point and interval) we continue the discus-

sion of frequentist statistical inference by focusing on hy-

pothesis testing. The discussion will be confined to the

basics of frequentist testing by presenting a bare bones nar-

rative with very few nuances.

As in the case of estimation, the cornerstone of testing is

the notion of a statistical model denoted by:

Mθ(x)={(x;θ) θ∈Θ} x∈X:=R

where (x;θ) x∈X is the (joint) distribution of the samplethat encapsulates the probabilistic structure of the sample

X:=(1 ) Like estimation, hypothesis testing can be

best understood in terms mappings between two sets:

X — sample space: the set of all possible values of the sample X

Θ — parameter space: the set of all possible values of θ

where, as in the case of estimation, inferences are framed in

terms of the unknown parameter(s) θ

The first important difference between estimation and test-

ing is the nature of the reasoning underlying the inference

procedures. The reasoning in estimation is factual, evalua-

tion of error probabilities is under the True State of Nature

1

(θ=θ∗), where the statement ‘θ∗denotes the true value of θ’is a shorthand for saying that ‘data x0 constitute a realiza-

tion of the sample X with distribution (x;θ∗)’. In testing,the reasoning is hypothetical: evaluation of the relevant error

probabilities is under various hypothetical scenarios revolving

around different values of θ inΘ associated with the null (0)

and alternative (1) hypotheses.

The main objective in statistical testing is to use the

sample information x∈X as summarized by (x;θ) in con-junction with data x0:=(1 2 ), to narrow down the

set of possible values of the unknown parameter θ∈Θ using

hypothetical values of θ. Ideally, the narrowing down reduces

Mθ(x) to a single pointM∗(x)={(x;θ∗)} x∈X. That is,hypothesis testing is all about learning fromdata x0 about the

‘true’ statistical Data Generating Mechanism (DGM)M∗(x).Instead of asking the data to pinpoint the value θ∗ in testingone assumes a specific value, say θ=θ0 and poses the question

to data data x0 whether θ0 is ‘close enough’ to θ∗ or not. In

general, hypothetical reasoning enables statistical testing to

pose sharper questions to the data and elicit more informative

answers.

Section 2 provides a bare bones account of Fisher’s signif-

icance testing by focusing on the simple Normal model. Sec-

tion 3 gives a basic account of the Neyman-Pearson (N-P)

approach to testing, paying particular attention to the elabo-

ration of the initial Fisher set up. Section 4 raises briefly some

of the foundational problems bedeviling both approaches to

testing since the 1930s.

2

2 Fisher’s significance testing

Example 1. Let us assume that the appropriate statisti-

cal model for data x0 is the simple (one parameter) Normal

model, where is known (table 1).

Table 1 - Simple Normal (one parameter)Model

Statistical GM: =+ ∈N={1 2 }[1] Normality: v N( ) ∈R[2] Constant mean: ()=

[3] Constant variance: ()=2 [known]

⎫⎬⎭ ∈N.

[4] Independence: { ∈N} independent process

Note that in this case X:=R and Θ=(−∞∞)A typical Fisher type null hypothesis takes the form:

0 : =0 (1)

where the particular value 0 is given. The question posed

by the null hypothesis in (1) is the extent to which data x0accords with the pre-specified value 0 i.e. Θ=(−∞∞):=Ris narrowed down to a single value.

Fisher used common sense to construct a test statistic

that aims to evaluate the accordance between 0 and data x0

based on the distance between a good estimator of and 0; a

good estimator is the next best thing to knowing As in the

case of a Confidence Interval, a ‘good’ (optimal) test begins

with a ‘good’ (optimal) estimator.

In lecture notes 3 it was shown that =1

P=1 pro-

vides a best estimator of because it has a sampling distrib-

3

ution:

=∗v N(∗

2

) (2)

and enjoys the optimal properties of unbiasedness (()=∗)

full efficiency ( ()=(∗)) and strong consistency

(P( lim→∞

=∗)), where ∗ denotes the ‘true’ value of what-

ever that value happens to be.

Remember that an optimal (1 − ) Confidence interval is

based on a standardized version of (2), defined in terms of the

pivotal quantity:√(−∗)

=∗v N(0 1) (3)

The question posed by 0 in (1) amounts to asking the

data x0whether the distance (∗ − 0) is ‘large enough’

to indicate discordance with 0 or not. In light of the fact

that ∗ is unknown, it makes intuitive sense to use the bestestimator of ∗ to define a test statistic in terms of the differ-ence ( − 0) which after standardization takes the form:

(X)=√(−0)

In view of the fact that is known, (X) constitutes a sta-

tistic: a function that involves no unknown parameters of

the form () : X → R. The question now is, ‘how does onedefine whether this distance is large enough?’ Since (X) is

a random variable (being a function of the sample X), that

question can only be answered in terms of its sampling distri-

bution. But what is it?

R.A. Fisher realized that one cannot use factual reasoning

4

to derive it because that will give rise to:

(X)=√(−0)

=∗v N(0 1) 0=√(∗−0)

(4)

which is non-operational because 0 involves the unknown ∗

Upon reflection, it became clear to Fisher that if the evalua-

tion was hypothetical, under 0: =0 then:

(X)=√(−0)

0v N(0 1) (5)

which is operational! Fisher (1956): “In general, tests of sig-

nificance are based on hypothetical probabilities calculated from

their null hypotheses.” (p. 47)

Fisher used the fact that the sampling distribution of (X)

in (5) is completely known to define a measure of discordance

between 0 and ∗ in terms of the p-value:

P((X) (x0); 0 true)=(x0) (6)

viewing it as an indicator of discordance between data x0 and

0; the bigger the observed test statistic (x0) the smaller the

p-value.

Definition. The p-value is the probability of getting an

outcome x∈X such that (x) is more discordant with0 than

(x0), when 0 is true.

Note: It is crucial to avoid the highly misleading notation,

P( (X) (x0) | 0)=(x0)

of the vertical line (|) instead of a semi-colon (;), since condi-tioning on0 makes no sense in frequentist inference; is not

a random variable. Hence, the p-value is NOT a conditional

probability of any type!

5

When the p-value is smaller than a certain threshold, say

= 05, it suggests that data x0 indicate a significant discor-

dance with 0. What about when the p-value is greater than

the selected threshold? Fisher was a strict falsificationism

and rejected any interpretation of that as indicating accor-

dance with 0.

Numerical examples

(i) Consider the case where =1 =100 0=12 and data x0gave rise to =122 The observed value of the test statistic

is (x0)=√100(122− 12)=20 which yields:P((X) (x0); 0 true)=023

This p-value indicates a clear discordance with 0: 0=12

(ii) Assuming the same values as above except =121

yields (x0)=√100(121−12)=1 which gives rise to a p-value:P((X) (x0); 0 true)=159

In this case Fisher would say that data x0 (=121) indi-

cate no discordance with the null, but that should NOT be

interpreted as indicating accordance with0! He was a strict

falsificationist.

The main components of a Fisher test (table 2).

Table 2 - Fisher test: main components

(i) a statistical model: Mθ(x)={(x;θ) θ∈Θ} x∈X,(ii) a null hypothesis 0

(iii) a test statistic (X)

(iv) the distribution of (X) under 0 [ascertainable],

(v) the -value P((X) (x0);0 true)=(x0)

(vi) a threshold below which the -value is significant

6

3 Neyman-Pearson (N-P) testing

Neyman and Pearson (1933) pointed out two crucial weak-

nesses in Fisher’s approach to testing:

(i) the arbitrariness of the p-value evaluation, and

(ii) the ad hoc choice of a test statistic.

Indeed, they viewed their primary contribution to statisti-

cal testing to be one of proposing a theory of optimal test-

ing — analogous to Fisher’s optimal estimation — by improving

upon two interrelated aspects of his significance testing. The

first was to replace the p-value with pre-data error proba-

bilities, type I and II, in terms of which one can define the

optimality of tests. The second was to justify the choice of a

test statistic itself in terms of the same notion of optimality.

The key to their approach to testing is given by the notion of

an alternative hypothesis.

3.1 Archetypal Neyman-Pearson (N-P) hypotheses

In the case where the sets Θ0 or Θ1 contain a single point, say

Θ={} = 0 1 that determines (x; 0) completely (no

unknowns), the hypothesis is said to be simple, otherwise it

is composite.

For a particular null hypothesis 0 in the context of a sta-

tistical model:

Mθ(x)={(x; ) ∈Θ} x∈X:=R

the default alternative hypothesis, denoted by 1, is

always defined as the complement of the null with respect

to the particular parameter space. That is, the null and the

7

alternative hypotheses should constitute a partition of the pa-

rameter space. The archetypal way to specify the null and

alternative hypotheses for N-P testing is:

0: ∈Θ0 vs. 1: ∈Θ1 (7)

where Θ0 and Θ1 constitute a partition of the parameter

space Θ; Θ0∩Θ1=∅ Θ0∪Θ1=Θ This is because for statisticalpurposes the whole of the parameter space is relevant, despite

the fact that only a few value of the unknown parameter are

often of interest for substantive purposes.

Example 1 (continued). In the case of (1) the alternative

is the rest of the parameter space:

0 : =0 vs. 1 : 6= 0 (8)

Similarly, the one-sided hypotheses:

0 : ≤0 vs. 1 : 0 (9)

0 : ≥0 vs. 1 : 0 (10)

constitute proper partitions of the parameter space.

N-P testing, in effect, partitions Θ into Θ0 and Θ1, and

poses the question whether data x0 accord better with

one or the other subset. It is often insufficiently appreci-

ated that this question pertains directly to the ‘true’ statistical

DGMM∗(x)={∗(x;θ∗)} x∈X, where ∗(x;θ∗) denotes thedistribution of the sample evaluated at θ=θ∗ This is because(7) can be written in the equivalent form:

0: ∗(x;θ∗)∈M0(x) vs. 1:

∗(x;θ∗)∈M1(x)

M0(x)={(x;θ) θ∈Θ0} M1(x)={(x;θ) θ∈Θ1} x∈X

8

To answer that question the N-P approach uses a test sta-

tistic (X) which maps the partition Θ0 and Θ1 into a cor-

responding partition of the sample space X, say 0 and 1

where 0 ∩ 1=∅ 0 ∪ 1=X known as acceptance and

rejection regions, respectively:

X=

½0 ↔ Θ01 ↔ Θ1

¾=Θ

Fig. 1 places N-P testing in a broader context where P(x)denotes the set of all possible models that could have given

rise to data x0

P ( )x

H0

H1

Fig. 1: Neyman-Pearson testing

In particular, N-P reduces hypothesis testing to a decision to

accept or reject the null hypothesis (0) according to where

data x0 happen to belong:

if x0 ∈ 0 accept 0

if x0 ∈ 1 reject 0

However, in view of the probabilistic nature of the underlying

statistical model each of these decisions carries with it the

possibility of error. In particular, viewing this choice as a

9

decision, there are two types of error one can perpetrate:

DecisionÂTSN 0 is true 0 is false

Accept 0 X type II error

Reject 0 type I error Xtype I error: reject 0 when true,

type II error: accept 0 when false.

The key to applying the N-P approach to testing is to be

able to evaluate these two types of error probabilities using

hypothetical reasoning associated with0 being true or false:

type I error probability: P(x0∈1; 0 true)=

type II error probability: P(x0∈0; 0 false)=

This depends crucially on knowing the sampling distribution

of the test statistic under these two scenarios. It is very im-

portant to emphasize that there is nothing conditional about

these error probabilities; they are evaluated under different

scenarios.

Example 1 (continued). For simplicity, let us consider the

hypotheses of interest:

0 : =0 vs. 1 : 0 (11)

in the context of the simple Normal model (table 1). It turns

out that the optimal N-P test for (11) concides with that for

the hypotheses:

0 : ≤0 vs. 1 : 0 (12)

which constitute a partition of the relevant parameter space.

Provisionally, let us adopt Fisher’s test statistic

(X)=√(−0)

as the appropriate distance function.

10

In light of the fact that departures from the null are asso-

ciated with large values of (X) an obvious rejection region

is:1()={x : (x) } (13)

where 0 is the threshold rejection value. Given that

the sampling distribution of (X) under 0 is given in (5),

and is completely known, one can evaluate the type I error

probability for different rejection values using:

P((X) ; 0 true)=

where is the type I error; 0 1. Similarly, an obvious

acceptance region is:

0()={x : (x) ≤ }since small values of (x) indicate accordance with 0

To evaluate the type II error probability one needs

to know the sampling distribution of (X) when 0 is false.

However, since 0 is false refers to 1 : 0 this evalua-

tion will involve all values of greater than 0 (i.e. 10) :

(1)=P((X) ≤ ; 0 false)=P((X) ≤ ; =1) ∀(10)The relevant sampling distribution takes the form:

(X)=√(−0)

=1v N( 1) where =√(1−0)

for all 10

(14)

That is, under 1 the sampling distribution of (X) is Nor-

mal, but its mean is non-zero and changes with 1 or equiv-

alently,

1(X)=√(−1)

=1v N(0 1) where (X)=1(X)+√(1−0)

(15)

11

A closer look at the above two types of error probabilities

reveals that both of them are evaluated from a certain thresh-

old It turns out that as increases the type I error [re-

jecting 0 when true] probability decreases because we make

it more and more difficult to reject the null. In contrast, the

type II error [accepting0 when false] probability increases as

increases because we make it easier to accept; the reverse

holds when decreases. Hence, there is a trade-off between

the two types of error probabilities as the threshold changes,

whichmeans that one cannot keep both error probabilities low

at the same time. Very low error probabilities are desirable

because that will render the test more effective in the sense

that it will make fewer errors (more reliable).

How can one address this trade-off?

Neyman and Pearson (1929) suggested that a natural way

to address this trade-off is as follows:

(a) Specify the null and alternative hypothesis in such a

way so as to render the type I error the more serious of the

two potential errors.

12

(b) Fix the type I error probability (significance level)

to a small number:

P((X) ; 0 true)=

say =.05 or =.01, where the choice of depend on the

particular circumstances.

(c) For a given choose the optimal test {(X) 1()}that minimizes the type II error probability for all values

10

The last step is often replaced with an equivalent step:

(c)* for a given choose a test {(X) 1()} that maxi-mizes the power of the test in question for all values 10 :

(1)=P((X) ; 1(1) true) for all 1 0 (16)

UniformlyMost Powerful (UMP). A test :={(X) 1()}is said to beUMP if it has higher power than any other -level

test e for all values ∈Θ1, i.e. (;) ≥ (; e) for all ∈Θ1

That is, a UMP N-P test is one whose power curve dominates

that of every other possible test in the sense that for all 10its power is greater than or equal to that of the other tests.

standard Normal tables

One-sided values Two-sided values

=100 : =128 =100 : =1645

=050 : =1645 =050 : =196

=025 : =196 =025 : =200

=010 : =233 =010 : =258

13

Example 1 (continued). In the above explication of the

basic components of a N-P test we have taken the Fisher test

statistic as given, but it’s not obvious that it gives rise to

a UMP test in the case of the hypotheses in (11). Does it?

Before we answer that question let us consider the problem of

evaluating the power of the test defined by {(X) 1()} for=025 and different values of 1 12

Consider the case where =1 =100 0=12 and data x0gave rise to =1227 The observed value of the test statistic

is (x0)=√100(1227−12)=27 which in view of the fact that

=196 for =025 it leads to a rejection of 0: 0=12 The

power of this test is defined by:

(1)=P((X) 196; 1(1) true) for all 1 12

In light of the fact that for the relevant sampling distri-

bution for the evaluation of (1) (14), the use the N(0 1)

tables requires one to split the test statistic into:√(−0)

=√(−1)

+ 1 1=

√(1−0)

to evaluate the power for different values of 1 use (15):

(1)=P³1(X) −

√(1−0)

´ for all 1 0 (17)

14

Table 3 - Power of {(X) 1(025)}=1−0 1=

√(1−0)

: (1)=P(

√(−1)

−1+;1)

=0 =0 (121)=P( 196) = 025

=05 =5 (121)=P( 196− 5) = 072

=1 =1 (121)=P( 196− 1) = 169

=2 =2 (122)=P( 196− 2) = 516

=3 =3 (123)=P( 196− 3) = 851

=4 =4 (124)=P( 196− 4) = 979

=5 =5 (125)=P( 196− 5) = 999

where is a generic standard Normal r.v., i.e. v N(0 1)The power of this test is typical of an optimal (UMP)

test since (1) increases with the non-zeromean 1=√(1−0)

:

(a) the power increases with the sample size

(b) the power increases with the discrepancy =(1−0)(c) the power decreases with

The features (a)-(b) are often used to decide pre-data on

how large should be to detect departures (1−0) of interestas part of the pre-data design of the study.

Returning to the original intentions by Neyman and Pear-

son (1933) to improve upon Fisher’s significance testing, we

can see that by bringing into the set up the notion of an alter-

native hypothesis defined as the compliment to the null, they

proposed to:

(i) replace the post-data p-value with the null, with the

pre-data type I and II error probabilities, and

(ii) define an optimal test in terms of notion of an signif-

icance level UMP test.

15

The notion of optimality renders the choice of the test sta-

tistic and the associated rejection region a matter of math-

ematical optimization, replacing Fisher’s intuition what test

statistic makes sense to use in different cases. It turned out

that in most cases Fisher’s initial intuition coincided with the

notion of an optimal test.

Example 1 (continued). The question that one might

naturally raise at this stage is whether the same test statistic

(X)=√(−0)

can be used to specify a UMP for testing the

two-sided hypotheses:

0 : =0 vs. 1 : 6=0 (18)

The rejection region in this case should naturally allow for

discrepancies on either side of 0 and would take the form:

∗1 ()={x : |(x)| 2} (19)

It turns out that the test defined by {(X) ∗1 ()} is notUMP, but it is UMP Unbiased; see Spanos (1999).

The main components of a N-P test are given in table 4.

Table 4 - N-P test main components

(i) a statistical model: Mθ(x)={(x;θ) θ∈Θ} x∈X,(ii) a null (0) and an alternative (1) hypothesis withinMθ(x)

(iii) a test statistic (X)

(iv) the distribution of (X) under 0 [ascertainable],

(v) the significance level (or size)

(vi) the rejection region 1()={x : (x) }(vii) the distribution of (X) under 1 [ascertainable].

Remarks. (i) It is very important to remember that a N-P

test is not just a test statistic! It is at least a pair :={(X) 1()}16

(ii) The optimality of an N-P test is inextricably bound up

with the optimality of the estimator the test statistic is based

on. Hence, it is no accident that most optimal N-P tests

are based on consistent, fully efficient and sufficient estimators.

For instance, if one were to replace in (X)=√(−0)

with the unbiased estimator b2=(1+)2 vN( 2

2) the

resulting test based:

(X)=√2(b2−0)

and 1()={x: (x) }

will not be optimal because its power is much lower than that

of . Worse, ̌ is an inconsistent test: its power does

not approach one as →∞ for all discrepancies (1 − 0)

Consistency is considered a mininal property as in the case of

optimal estimators.

(iii) Also, by changing the rejection region one can render

an optimal N-P test useless! For instance, replacing the rejec-

tion region of {(X) 1()} with: 1()={x: (x) }the resulting test T:={(X) 1()} is practically uselessbecause it is a biased test, i.e. its power is less than the sig-

nificance level, ((X) (x0);=1) ≤ , for all 1 0,

and it decreases as the discrepancy increases.

17

4 The fallacies of acceptance and rejection

It turned out that despite these obvious technical develop-

ments in statistical testing, both the Fisher and N-P ap-

proaches suffered from serious foundational problems due to

the lack of a coherent philosophy of testing. In particular, nei-

ther account gave a satisfactory answer to the basic question

(see Mayo, 1996):

when do data x0 provide evidence for (or against)

a hypothesis or a claim ?

Fisher’s notion of the p-value as a measure of discordance,

gave rise to several foundational questions that remained largely

unanswered since the 1930s:

(a) how does one interpret a small p-value as it pertains to

evidence against 0?

(b) how does one interpret a large p-value as it pertains to

evidence for 0?

(c) since the p-value depends on the sample size how

does one interpret a small

p-value when is very large?

Similarly, the N-P approach to hypothesis testing gave rise

to several foundational questions of its own:

(d) does one interpret accepting 0 as evidence for 0? If

not, why?

(e) does one interpret rejecting 0 as evidence for 1? If

not, why?

These questions are associated with two fundamental fal-

lacies that have bedeviled frequentist testing since the 1930s:

18

¥ (I) The fallacy of acceptance: (mis)-interpreting accept

0 [no evidence against 0] as evidence for 0; e.g. the test

had low power to detect existing discrepancy,

¥ (II) the fallacy of rejection: (mis)-interpreting reject 0

[evidence against 0] as evidence for a particular1; e.g. the

test had very high power to detect even tiny discrepancies.

The best example of the fallacy of acceptance is when 0

is false but the test applied did not have any power to detect

the particular discrepancy present. N-P testing is clearly vul-

nerable to this fallacy, especially when the power of the test

is not evaluated.

Example 1 (continued). In the case where = 025

=1 =100 0=12 and data x0 gave rise to =1218 The

observed value of the test statistic is (x0)=√100(1218 −

12)=18 which leads to the acceptance of 0: 0=12 since

= 196 If, however, the sustantive discrepancy of interest

in this case is (1−0)=2 then according to table 3, thepower of the test {(X) 1()} is only (122)=516 which

is rather low, and thus the test might not be able to detect

such a discrepancy even if it’s present.

A good example of the fallacy of rejection is conflating sta-

tistical with substantive significance. It could easily be that

the test has very high power (e.g. when the sample size is

very large) and the detected discrepancy is substantively tiny

but statistically significant due to the over-sensitivity of the

test in question. Both Fisher and N-P testing are vulnerable

to this fallacy.

The question that naturally arises at this stage is whether

19

there is a way to address the fallacies of acceptance and re-

jection and the measleading interpretations associated with

observed CIs, and at the same time provide a coherent in-

terpretation to statistical testing that offers an inferential in-

terpretation that pertains to substantive hypotheses of inter-

est? Error statistics provides such a coherent intepretation

of testing and enables one to address these fallacies using a

post-data evaluation of the accept/reject decisions using se-

vere testing reasoning; see Mayo (1996), Mayo and Spanos

(2006).

The error statistical approach, viewed narrowly at the sta-

tistical level, blends in the Fisher and Neyman-Pearson (N-P)

testing perspectives to weave a coherent frequentist inductive

reasoning anchored firmly on error probabilities, both pre and

post data. The key to this coalescing is provided by recog-

nizing that Fisher’s p-value reasoning is based on a post-data

error probability, and Neyman-Pearson’s type I and II errors

reasoning is based on pre-data error probabilities, and they

play complementary, not contradictory, roles.

5 Summary and conclusions

In frequentist inference learning from data x0 about the

stochastic phenomenon of interest is accomplished by applying

optimal inference procedures with ascertainable error probabil-

ities in the context of a statistical model:

Mθ(x)={(x;θ) θ∈Θ} x∈R (20)

Hypothesis testing gives rise to learning from data x0 by

partitioningMθ(x) into two subsets framed in terms of the

20

parameter(s):

0: ∈Θ0 vs. 0: ∈Θ1 (21)

and use x0 pose questions using hypothetical reasoning to learn

about M∗(x)={(x;θ∗)} x∈X, where ∗(x)=(x;θ∗) de-notes the ‘true’ distribution of the sample. A more perceptive

way to specify (21) is:

0

∈Θ0z }| {∗(x)∈M0(x)={(x;θ) θ∈Θ0} vs. 1:

∈Θ1z }| {∗(x)∈M1(x)={(x;θ) θ∈Θ1}

This notation elucidates the basic question posed in hy-

pothesis testing:

IAssuming that the trueM∗(x) lies within the boundariesofMθ(x) can x0 be used to narrow it down to a smaller subset

M0(x)?

A test :={(X) 1()} is defined in terms of a test sta-tistic ((X)) and a rejection region (1()), and its optimality

is calibrated in terms of the relevant error probabilities:

type I: P(x0∈1; 0() true) ≤ () for ∈Θ0type II: P(x0∈0; 1(1) true)=(1) for 1∈Θ1

These error probabilities specify how often these procedures

lead to erroneous inferences. For a given significance level

the optimal N-P test is the one whose pre-data capacity

(power):

(1)=P(x0∈1; =1) for all 1∈Θ1is maximum for all 1∈Θ1As they stand, neither the p-value nor the accept/reject0

rules provide an evidential interpretation pertaining to ‘when

data x0 provide evidence for (or against) a hypothesis (or

claim) ’.

21

probability/statistics lecture notes 4: hypothesis testing

Education