probability/statistics lecture notes 4: hypothesis testing
DESCRIPTION
A. Spanos (PHIL 6334) "Probability/Statistics Lecture Notes 4: Hypothesis Testing"TRANSCRIPT
PHIL 6334 - Probability/Statistics Lecture Notes 4:
Hypothesis Testing
Aris Spanos [Spring 2014]
1 Introduction
After estimation (point and interval) we continue the discus-
sion of frequentist statistical inference by focusing on hy-
pothesis testing. The discussion will be confined to the
basics of frequentist testing by presenting a bare bones nar-
rative with very few nuances.
As in the case of estimation, the cornerstone of testing is
the notion of a statistical model denoted by:
Mθ(x)={(x;θ) θ∈Θ} x∈X:=R
where (x;θ) x∈X is the (joint) distribution of the samplethat encapsulates the probabilistic structure of the sample
X:=(1 ) Like estimation, hypothesis testing can be
best understood in terms mappings between two sets:
X — sample space: the set of all possible values of the sample X
Θ — parameter space: the set of all possible values of θ
where, as in the case of estimation, inferences are framed in
terms of the unknown parameter(s) θ
The first important difference between estimation and test-
ing is the nature of the reasoning underlying the inference
procedures. The reasoning in estimation is factual, evalua-
tion of error probabilities is under the True State of Nature
1
(θ=θ∗), where the statement ‘θ∗denotes the true value of θ’is a shorthand for saying that ‘data x0 constitute a realiza-
tion of the sample X with distribution (x;θ∗)’. In testing,the reasoning is hypothetical: evaluation of the relevant error
probabilities is under various hypothetical scenarios revolving
around different values of θ inΘ associated with the null (0)
and alternative (1) hypotheses.
The main objective in statistical testing is to use the
sample information x∈X as summarized by (x;θ) in con-junction with data x0:=(1 2 ), to narrow down the
set of possible values of the unknown parameter θ∈Θ using
hypothetical values of θ. Ideally, the narrowing down reduces
Mθ(x) to a single pointM∗(x)={(x;θ∗)} x∈X. That is,hypothesis testing is all about learning fromdata x0 about the
‘true’ statistical Data Generating Mechanism (DGM)M∗(x).Instead of asking the data to pinpoint the value θ∗ in testingone assumes a specific value, say θ=θ0 and poses the question
to data data x0 whether θ0 is ‘close enough’ to θ∗ or not. In
general, hypothetical reasoning enables statistical testing to
pose sharper questions to the data and elicit more informative
answers.
Section 2 provides a bare bones account of Fisher’s signif-
icance testing by focusing on the simple Normal model. Sec-
tion 3 gives a basic account of the Neyman-Pearson (N-P)
approach to testing, paying particular attention to the elabo-
ration of the initial Fisher set up. Section 4 raises briefly some
of the foundational problems bedeviling both approaches to
testing since the 1930s.
2
2 Fisher’s significance testing
Example 1. Let us assume that the appropriate statisti-
cal model for data x0 is the simple (one parameter) Normal
model, where is known (table 1).
Table 1 - Simple Normal (one parameter)Model
Statistical GM: =+ ∈N={1 2 }[1] Normality: v N( ) ∈R[2] Constant mean: ()=
[3] Constant variance: ()=2 [known]
⎫⎬⎭ ∈N.
[4] Independence: { ∈N} independent process
Note that in this case X:=R and Θ=(−∞∞)A typical Fisher type null hypothesis takes the form:
0 : =0 (1)
where the particular value 0 is given. The question posed
by the null hypothesis in (1) is the extent to which data x0accords with the pre-specified value 0 i.e. Θ=(−∞∞):=Ris narrowed down to a single value.
Fisher used common sense to construct a test statistic
that aims to evaluate the accordance between 0 and data x0
based on the distance between a good estimator of and 0; a
good estimator is the next best thing to knowing As in the
case of a Confidence Interval, a ‘good’ (optimal) test begins
with a ‘good’ (optimal) estimator.
In lecture notes 3 it was shown that =1
P=1 pro-
vides a best estimator of because it has a sampling distrib-
3
ution:
=∗v N(∗
2
) (2)
and enjoys the optimal properties of unbiasedness (()=∗)
full efficiency ( ()=(∗)) and strong consistency
(P( lim→∞
=∗)), where ∗ denotes the ‘true’ value of what-
ever that value happens to be.
Remember that an optimal (1 − ) Confidence interval is
based on a standardized version of (2), defined in terms of the
pivotal quantity:√(−∗)
=∗v N(0 1) (3)
The question posed by 0 in (1) amounts to asking the
data x0whether the distance (∗ − 0) is ‘large enough’
to indicate discordance with 0 or not. In light of the fact
that ∗ is unknown, it makes intuitive sense to use the bestestimator of ∗ to define a test statistic in terms of the differ-ence ( − 0) which after standardization takes the form:
(X)=√(−0)
In view of the fact that is known, (X) constitutes a sta-
tistic: a function that involves no unknown parameters of
the form () : X → R. The question now is, ‘how does onedefine whether this distance is large enough?’ Since (X) is
a random variable (being a function of the sample X), that
question can only be answered in terms of its sampling distri-
bution. But what is it?
R.A. Fisher realized that one cannot use factual reasoning
4
to derive it because that will give rise to:
(X)=√(−0)
=∗v N(0 1) 0=√(∗−0)
(4)
which is non-operational because 0 involves the unknown ∗
Upon reflection, it became clear to Fisher that if the evalua-
tion was hypothetical, under 0: =0 then:
(X)=√(−0)
0v N(0 1) (5)
which is operational! Fisher (1956): “In general, tests of sig-
nificance are based on hypothetical probabilities calculated from
their null hypotheses.” (p. 47)
Fisher used the fact that the sampling distribution of (X)
in (5) is completely known to define a measure of discordance
between 0 and ∗ in terms of the p-value:
P((X) (x0); 0 true)=(x0) (6)
viewing it as an indicator of discordance between data x0 and
0; the bigger the observed test statistic (x0) the smaller the
p-value.
Definition. The p-value is the probability of getting an
outcome x∈X such that (x) is more discordant with0 than
(x0), when 0 is true.
Note: It is crucial to avoid the highly misleading notation,
P( (X) (x0) | 0)=(x0)
of the vertical line (|) instead of a semi-colon (;), since condi-tioning on0 makes no sense in frequentist inference; is not
a random variable. Hence, the p-value is NOT a conditional
probability of any type!
5
When the p-value is smaller than a certain threshold, say
= 05, it suggests that data x0 indicate a significant discor-
dance with 0. What about when the p-value is greater than
the selected threshold? Fisher was a strict falsificationism
and rejected any interpretation of that as indicating accor-
dance with 0.
Numerical examples
(i) Consider the case where =1 =100 0=12 and data x0gave rise to =122 The observed value of the test statistic
is (x0)=√100(122− 12)=20 which yields:P((X) (x0); 0 true)=023
This p-value indicates a clear discordance with 0: 0=12
(ii) Assuming the same values as above except =121
yields (x0)=√100(121−12)=1 which gives rise to a p-value:P((X) (x0); 0 true)=159
In this case Fisher would say that data x0 (=121) indi-
cate no discordance with the null, but that should NOT be
interpreted as indicating accordance with0! He was a strict
falsificationist.
The main components of a Fisher test (table 2).
Table 2 - Fisher test: main components
(i) a statistical model: Mθ(x)={(x;θ) θ∈Θ} x∈X,(ii) a null hypothesis 0
(iii) a test statistic (X)
(iv) the distribution of (X) under 0 [ascertainable],
(v) the -value P((X) (x0);0 true)=(x0)
(vi) a threshold below which the -value is significant
6
3 Neyman-Pearson (N-P) testing
Neyman and Pearson (1933) pointed out two crucial weak-
nesses in Fisher’s approach to testing:
(i) the arbitrariness of the p-value evaluation, and
(ii) the ad hoc choice of a test statistic.
Indeed, they viewed their primary contribution to statisti-
cal testing to be one of proposing a theory of optimal test-
ing — analogous to Fisher’s optimal estimation — by improving
upon two interrelated aspects of his significance testing. The
first was to replace the p-value with pre-data error proba-
bilities, type I and II, in terms of which one can define the
optimality of tests. The second was to justify the choice of a
test statistic itself in terms of the same notion of optimality.
The key to their approach to testing is given by the notion of
an alternative hypothesis.
3.1 Archetypal Neyman-Pearson (N-P) hypotheses
In the case where the sets Θ0 or Θ1 contain a single point, say
Θ={} = 0 1 that determines (x; 0) completely (no
unknowns), the hypothesis is said to be simple, otherwise it
is composite.
For a particular null hypothesis 0 in the context of a sta-
tistical model:
Mθ(x)={(x; ) ∈Θ} x∈X:=R
the default alternative hypothesis, denoted by 1, is
always defined as the complement of the null with respect
to the particular parameter space. That is, the null and the
7
alternative hypotheses should constitute a partition of the pa-
rameter space. The archetypal way to specify the null and
alternative hypotheses for N-P testing is:
0: ∈Θ0 vs. 1: ∈Θ1 (7)
where Θ0 and Θ1 constitute a partition of the parameter
space Θ; Θ0∩Θ1=∅ Θ0∪Θ1=Θ This is because for statisticalpurposes the whole of the parameter space is relevant, despite
the fact that only a few value of the unknown parameter are
often of interest for substantive purposes.
Example 1 (continued). In the case of (1) the alternative
is the rest of the parameter space:
0 : =0 vs. 1 : 6= 0 (8)
Similarly, the one-sided hypotheses:
0 : ≤0 vs. 1 : 0 (9)
0 : ≥0 vs. 1 : 0 (10)
constitute proper partitions of the parameter space.
N-P testing, in effect, partitions Θ into Θ0 and Θ1, and
poses the question whether data x0 accord better with
one or the other subset. It is often insufficiently appreci-
ated that this question pertains directly to the ‘true’ statistical
DGMM∗(x)={∗(x;θ∗)} x∈X, where ∗(x;θ∗) denotes thedistribution of the sample evaluated at θ=θ∗ This is because(7) can be written in the equivalent form:
0: ∗(x;θ∗)∈M0(x) vs. 1:
∗(x;θ∗)∈M1(x)
M0(x)={(x;θ) θ∈Θ0} M1(x)={(x;θ) θ∈Θ1} x∈X
8
To answer that question the N-P approach uses a test sta-
tistic (X) which maps the partition Θ0 and Θ1 into a cor-
responding partition of the sample space X, say 0 and 1
where 0 ∩ 1=∅ 0 ∪ 1=X known as acceptance and
rejection regions, respectively:
X=
½0 ↔ Θ01 ↔ Θ1
¾=Θ
Fig. 1 places N-P testing in a broader context where P(x)denotes the set of all possible models that could have given
rise to data x0
P ( )x
H0
H1
Fig. 1: Neyman-Pearson testing
In particular, N-P reduces hypothesis testing to a decision to
accept or reject the null hypothesis (0) according to where
data x0 happen to belong:
if x0 ∈ 0 accept 0
if x0 ∈ 1 reject 0
However, in view of the probabilistic nature of the underlying
statistical model each of these decisions carries with it the
possibility of error. In particular, viewing this choice as a
9
decision, there are two types of error one can perpetrate:
DecisionÂTSN 0 is true 0 is false
Accept 0 X type II error
Reject 0 type I error Xtype I error: reject 0 when true,
type II error: accept 0 when false.
The key to applying the N-P approach to testing is to be
able to evaluate these two types of error probabilities using
hypothetical reasoning associated with0 being true or false:
type I error probability: P(x0∈1; 0 true)=
type II error probability: P(x0∈0; 0 false)=
This depends crucially on knowing the sampling distribution
of the test statistic under these two scenarios. It is very im-
portant to emphasize that there is nothing conditional about
these error probabilities; they are evaluated under different
scenarios.
Example 1 (continued). For simplicity, let us consider the
hypotheses of interest:
0 : =0 vs. 1 : 0 (11)
in the context of the simple Normal model (table 1). It turns
out that the optimal N-P test for (11) concides with that for
the hypotheses:
0 : ≤0 vs. 1 : 0 (12)
which constitute a partition of the relevant parameter space.
Provisionally, let us adopt Fisher’s test statistic
(X)=√(−0)
as the appropriate distance function.
10
In light of the fact that departures from the null are asso-
ciated with large values of (X) an obvious rejection region
is:1()={x : (x) } (13)
where 0 is the threshold rejection value. Given that
the sampling distribution of (X) under 0 is given in (5),
and is completely known, one can evaluate the type I error
probability for different rejection values using:
P((X) ; 0 true)=
where is the type I error; 0 1. Similarly, an obvious
acceptance region is:
0()={x : (x) ≤ }since small values of (x) indicate accordance with 0
To evaluate the type II error probability one needs
to know the sampling distribution of (X) when 0 is false.
However, since 0 is false refers to 1 : 0 this evalua-
tion will involve all values of greater than 0 (i.e. 10) :
(1)=P((X) ≤ ; 0 false)=P((X) ≤ ; =1) ∀(10)The relevant sampling distribution takes the form:
(X)=√(−0)
=1v N( 1) where =√(1−0)
for all 10
(14)
That is, under 1 the sampling distribution of (X) is Nor-
mal, but its mean is non-zero and changes with 1 or equiv-
alently,
1(X)=√(−1)
=1v N(0 1) where (X)=1(X)+√(1−0)
(15)
11
A closer look at the above two types of error probabilities
reveals that both of them are evaluated from a certain thresh-
old It turns out that as increases the type I error [re-
jecting 0 when true] probability decreases because we make
it more and more difficult to reject the null. In contrast, the
type II error [accepting0 when false] probability increases as
increases because we make it easier to accept; the reverse
holds when decreases. Hence, there is a trade-off between
the two types of error probabilities as the threshold changes,
whichmeans that one cannot keep both error probabilities low
at the same time. Very low error probabilities are desirable
because that will render the test more effective in the sense
that it will make fewer errors (more reliable).
How can one address this trade-off?
Neyman and Pearson (1929) suggested that a natural way
to address this trade-off is as follows:
(a) Specify the null and alternative hypothesis in such a
way so as to render the type I error the more serious of the
two potential errors.
12
(b) Fix the type I error probability (significance level)
to a small number:
P((X) ; 0 true)=
say =.05 or =.01, where the choice of depend on the
particular circumstances.
(c) For a given choose the optimal test {(X) 1()}that minimizes the type II error probability for all values
10
The last step is often replaced with an equivalent step:
(c)* for a given choose a test {(X) 1()} that maxi-mizes the power of the test in question for all values 10 :
(1)=P((X) ; 1(1) true) for all 1 0 (16)
UniformlyMost Powerful (UMP). A test :={(X) 1()}is said to beUMP if it has higher power than any other -level
test e for all values ∈Θ1, i.e. (;) ≥ (; e) for all ∈Θ1
That is, a UMP N-P test is one whose power curve dominates
that of every other possible test in the sense that for all 10its power is greater than or equal to that of the other tests.
standard Normal tables
One-sided values Two-sided values
=100 : =128 =100 : =1645
=050 : =1645 =050 : =196
=025 : =196 =025 : =200
=010 : =233 =010 : =258
13
Example 1 (continued). In the above explication of the
basic components of a N-P test we have taken the Fisher test
statistic as given, but it’s not obvious that it gives rise to
a UMP test in the case of the hypotheses in (11). Does it?
Before we answer that question let us consider the problem of
evaluating the power of the test defined by {(X) 1()} for=025 and different values of 1 12
Consider the case where =1 =100 0=12 and data x0gave rise to =1227 The observed value of the test statistic
is (x0)=√100(1227−12)=27 which in view of the fact that
=196 for =025 it leads to a rejection of 0: 0=12 The
power of this test is defined by:
(1)=P((X) 196; 1(1) true) for all 1 12
In light of the fact that for the relevant sampling distri-
bution for the evaluation of (1) (14), the use the N(0 1)
tables requires one to split the test statistic into:√(−0)
=√(−1)
+ 1 1=
√(1−0)
to evaluate the power for different values of 1 use (15):
(1)=P³1(X) −
√(1−0)
´ for all 1 0 (17)
14
Table 3 - Power of {(X) 1(025)}=1−0 1=
√(1−0)
: (1)=P(
√(−1)
−1+;1)
=0 =0 (121)=P( 196) = 025
=05 =5 (121)=P( 196− 5) = 072
=1 =1 (121)=P( 196− 1) = 169
=2 =2 (122)=P( 196− 2) = 516
=3 =3 (123)=P( 196− 3) = 851
=4 =4 (124)=P( 196− 4) = 979
=5 =5 (125)=P( 196− 5) = 999
where is a generic standard Normal r.v., i.e. v N(0 1)The power of this test is typical of an optimal (UMP)
test since (1) increases with the non-zeromean 1=√(1−0)
:
(a) the power increases with the sample size
(b) the power increases with the discrepancy =(1−0)(c) the power decreases with
The features (a)-(b) are often used to decide pre-data on
how large should be to detect departures (1−0) of interestas part of the pre-data design of the study.
Returning to the original intentions by Neyman and Pear-
son (1933) to improve upon Fisher’s significance testing, we
can see that by bringing into the set up the notion of an alter-
native hypothesis defined as the compliment to the null, they
proposed to:
(i) replace the post-data p-value with the null, with the
pre-data type I and II error probabilities, and
(ii) define an optimal test in terms of notion of an signif-
icance level UMP test.
15
The notion of optimality renders the choice of the test sta-
tistic and the associated rejection region a matter of math-
ematical optimization, replacing Fisher’s intuition what test
statistic makes sense to use in different cases. It turned out
that in most cases Fisher’s initial intuition coincided with the
notion of an optimal test.
Example 1 (continued). The question that one might
naturally raise at this stage is whether the same test statistic
(X)=√(−0)
can be used to specify a UMP for testing the
two-sided hypotheses:
0 : =0 vs. 1 : 6=0 (18)
The rejection region in this case should naturally allow for
discrepancies on either side of 0 and would take the form:
∗1 ()={x : |(x)| 2} (19)
It turns out that the test defined by {(X) ∗1 ()} is notUMP, but it is UMP Unbiased; see Spanos (1999).
The main components of a N-P test are given in table 4.
Table 4 - N-P test main components
(i) a statistical model: Mθ(x)={(x;θ) θ∈Θ} x∈X,(ii) a null (0) and an alternative (1) hypothesis withinMθ(x)
(iii) a test statistic (X)
(iv) the distribution of (X) under 0 [ascertainable],
(v) the significance level (or size)
(vi) the rejection region 1()={x : (x) }(vii) the distribution of (X) under 1 [ascertainable].
Remarks. (i) It is very important to remember that a N-P
test is not just a test statistic! It is at least a pair :={(X) 1()}16
(ii) The optimality of an N-P test is inextricably bound up
with the optimality of the estimator the test statistic is based
on. Hence, it is no accident that most optimal N-P tests
are based on consistent, fully efficient and sufficient estimators.
For instance, if one were to replace in (X)=√(−0)
with the unbiased estimator b2=(1+)2 vN( 2
2) the
resulting test based:
(X)=√2(b2−0)
and 1()={x: (x) }
will not be optimal because its power is much lower than that
of . Worse, ̌ is an inconsistent test: its power does
not approach one as →∞ for all discrepancies (1 − 0)
Consistency is considered a mininal property as in the case of
optimal estimators.
(iii) Also, by changing the rejection region one can render
an optimal N-P test useless! For instance, replacing the rejec-
tion region of {(X) 1()} with: 1()={x: (x) }the resulting test T:={(X) 1()} is practically uselessbecause it is a biased test, i.e. its power is less than the sig-
nificance level, ((X) (x0);=1) ≤ , for all 1 0,
and it decreases as the discrepancy increases.
17
4 The fallacies of acceptance and rejection
It turned out that despite these obvious technical develop-
ments in statistical testing, both the Fisher and N-P ap-
proaches suffered from serious foundational problems due to
the lack of a coherent philosophy of testing. In particular, nei-
ther account gave a satisfactory answer to the basic question
(see Mayo, 1996):
when do data x0 provide evidence for (or against)
a hypothesis or a claim ?
Fisher’s notion of the p-value as a measure of discordance,
gave rise to several foundational questions that remained largely
unanswered since the 1930s:
(a) how does one interpret a small p-value as it pertains to
evidence against 0?
(b) how does one interpret a large p-value as it pertains to
evidence for 0?
(c) since the p-value depends on the sample size how
does one interpret a small
p-value when is very large?
Similarly, the N-P approach to hypothesis testing gave rise
to several foundational questions of its own:
(d) does one interpret accepting 0 as evidence for 0? If
not, why?
(e) does one interpret rejecting 0 as evidence for 1? If
not, why?
These questions are associated with two fundamental fal-
lacies that have bedeviled frequentist testing since the 1930s:
18
¥ (I) The fallacy of acceptance: (mis)-interpreting accept
0 [no evidence against 0] as evidence for 0; e.g. the test
had low power to detect existing discrepancy,
¥ (II) the fallacy of rejection: (mis)-interpreting reject 0
[evidence against 0] as evidence for a particular1; e.g. the
test had very high power to detect even tiny discrepancies.
The best example of the fallacy of acceptance is when 0
is false but the test applied did not have any power to detect
the particular discrepancy present. N-P testing is clearly vul-
nerable to this fallacy, especially when the power of the test
is not evaluated.
Example 1 (continued). In the case where = 025
=1 =100 0=12 and data x0 gave rise to =1218 The
observed value of the test statistic is (x0)=√100(1218 −
12)=18 which leads to the acceptance of 0: 0=12 since
= 196 If, however, the sustantive discrepancy of interest
in this case is (1−0)=2 then according to table 3, thepower of the test {(X) 1()} is only (122)=516 which
is rather low, and thus the test might not be able to detect
such a discrepancy even if it’s present.
A good example of the fallacy of rejection is conflating sta-
tistical with substantive significance. It could easily be that
the test has very high power (e.g. when the sample size is
very large) and the detected discrepancy is substantively tiny
but statistically significant due to the over-sensitivity of the
test in question. Both Fisher and N-P testing are vulnerable
to this fallacy.
The question that naturally arises at this stage is whether
19
there is a way to address the fallacies of acceptance and re-
jection and the measleading interpretations associated with
observed CIs, and at the same time provide a coherent in-
terpretation to statistical testing that offers an inferential in-
terpretation that pertains to substantive hypotheses of inter-
est? Error statistics provides such a coherent intepretation
of testing and enables one to address these fallacies using a
post-data evaluation of the accept/reject decisions using se-
vere testing reasoning; see Mayo (1996), Mayo and Spanos
(2006).
The error statistical approach, viewed narrowly at the sta-
tistical level, blends in the Fisher and Neyman-Pearson (N-P)
testing perspectives to weave a coherent frequentist inductive
reasoning anchored firmly on error probabilities, both pre and
post data. The key to this coalescing is provided by recog-
nizing that Fisher’s p-value reasoning is based on a post-data
error probability, and Neyman-Pearson’s type I and II errors
reasoning is based on pre-data error probabilities, and they
play complementary, not contradictory, roles.
5 Summary and conclusions
In frequentist inference learning from data x0 about the
stochastic phenomenon of interest is accomplished by applying
optimal inference procedures with ascertainable error probabil-
ities in the context of a statistical model:
Mθ(x)={(x;θ) θ∈Θ} x∈R (20)
Hypothesis testing gives rise to learning from data x0 by
partitioningMθ(x) into two subsets framed in terms of the
20
parameter(s):
0: ∈Θ0 vs. 0: ∈Θ1 (21)
and use x0 pose questions using hypothetical reasoning to learn
about M∗(x)={(x;θ∗)} x∈X, where ∗(x)=(x;θ∗) de-notes the ‘true’ distribution of the sample. A more perceptive
way to specify (21) is:
0
∈Θ0z }| {∗(x)∈M0(x)={(x;θ) θ∈Θ0} vs. 1:
∈Θ1z }| {∗(x)∈M1(x)={(x;θ) θ∈Θ1}
This notation elucidates the basic question posed in hy-
pothesis testing:
IAssuming that the trueM∗(x) lies within the boundariesofMθ(x) can x0 be used to narrow it down to a smaller subset
M0(x)?
A test :={(X) 1()} is defined in terms of a test sta-tistic ((X)) and a rejection region (1()), and its optimality
is calibrated in terms of the relevant error probabilities:
type I: P(x0∈1; 0() true) ≤ () for ∈Θ0type II: P(x0∈0; 1(1) true)=(1) for 1∈Θ1
These error probabilities specify how often these procedures
lead to erroneous inferences. For a given significance level
the optimal N-P test is the one whose pre-data capacity
(power):
(1)=P(x0∈1; =1) for all 1∈Θ1is maximum for all 1∈Θ1As they stand, neither the p-value nor the accept/reject0
rules provide an evidential interpretation pertaining to ‘when
data x0 provide evidence for (or against) a hypothesis (or
claim) ’.
21