Hypothesis Test
An Old Example Again
How about This? Students from University A rush to Fort Lauderdale.
We know the average SAT score of students from this university is 600, with a standard deviation of 15. Sampled 100 college students in Fort Lauderdale and
found their average SAT score is 620 Are they from University A?
Does one sample belongs to a known population or to a totally different population? Two populations:
Totally different groups of people Same group of people before and after a particular treatment
Another Example: Does A Treatment Have An Effect?
=?
Hypothesis Test What is hypothesis testing? Use sample statistics to evaluate a hypothesis
about a population Steps
1. State the hypotheses about the value of the population mean
2. Set the criteria for a decision3. Collect data and compute sample statistics4. Make a decision based on the criteria and
statistics
The logic of Hypothesis Testing
µ M
σ
Example A study on the effect of mild electrical brain
stimulation mathematics skills IV: electric current DV: scores of standard math test Construct validity Stimulation: electric current Math skills: test score
Step 1: Hypotheses Two hypotheses are needed here The null hypothesis The IV has no effect on the DV
The sample mean is “equal” to the population mean H0: µ with stimulation = 80
The alternative hypothesis The IV has an effect on the DV.
The sample mean is not “equal” to the population mean H1: µ with stimulation ≠ 80
The H0 and H1 are mutually exclusive and exhaustive Only one can be true One must be true
Step 2: Set the Criteria When we have a sample mean different from
the population mean, we often ask a question: How likely is the difference due to random errors
rather than system errors? For hypothesis testing, this question is: How likely can we get this particular sample mean
if H0 is true? We need the criteria for decision making.
The Alpha Level A value to separate the high-probability
samples from the low probability samples Also called the level of significance Define the very unlikely sample outcomes if
the null hypothesis is true Must be small 5%, 1%, 0.1%
Its implication The probability to make a mistaken claim is less than
the alpha level
Critical Region The region composed of extreme sample
values Value falling into the critical region is very
unlikely to occur if H0 is true If a sample mean falls into the region It is unlikely the sample is from the population. The sample may come from a different population
with a different population mean The null hypothesis should be rejected.
In Practice … We use z-scores to specify
the boundaries that defines the critical region
We need the unit normal table to locate the z-scores
One tail or two tails Depend on your null
hypothesis
For this example, two tails: z-score: +/- 1.96
Step 3: Collect Data and Compute Statistics Data can be collected in different ways Experiments Surveys
Statistics computation Find the mean Find the corresponding z-score Out interests: sample means Standard error should be used
Step 4: Make a Decision Two possible outcomes Reject the null hypothesis Fail reject the null hypothesis
Rejecting the null hypothesis The z-score for the sample mean is beyond the z-
scores defining the critical region Big discrepancy between the sample and the null hypothesis
Unlikely to happen if the null hypothesis is true The alternative hypothesis is true The treatment has an impact
Μ = 92, σ = 20, sample size = 25 σ M= ? z =
Step 4: Make a Decision (Cont.) Failing to reject the null hypothesis The sample does not fall in the critical region You cannot reject the null hypothesis
This does not mean the null hypothesis is true It may be false, but the study fails to prove it
Μ = 84, σ = 20, sample size = 25 z = ?
Two outcomes: You have enough evidence to show the treatment has
an impact Evidence you gathered is not convincing. You cannot
prove the null hypothesis is wrong. All you can say is that your data fail to show the treatment has an impact. You cannot say the treatment has no effect.
Analogy: Hypothesis Test as Jury Trial
Null hypothesis The defendant is innocent until proven guilty.
H0: Defendant = not guilty H1: Defendant = guilty
Alpha level The jury must be convinced beyond a reasonable doubt before they believe
that the defendant is guilty The probability to wrongly convict the defendant
Critical region Sufficient evidence
Sample data Evidence presented by the prosecutors
Decision Reject the null hypothesis: the defendant is guilty
Sufficient evidence beyond a reasonable doubt The defendant was wrongfully convicted. Evidence is wrong.
Fail to reject the null hypothesis Fail to find the defendant is guilty based on given evidence The defendant could be guilty, but evidence is not strong enough
Uncertainty and Errors Hypothesis testing is an
inferential process Its conclusion could be
correct or incorrect. Four different outcomes
Which Error is More Dangerous?
Two Types of Errors Type I errors Reject the null hypothesis that is actually true The treatment is claimed to have an effect
although it actually does not. Type II errors Fail to reject the null hypothesis that is actually
false The treatment indeed has an effect, but the study
fails to find it.
The Danger of Type I Errors Rejecting the null hypothesis is very tempting. It means a scientific discovery.
But a false discovery could be fatal. Research as building unit theory Research is built upon previous results and
findings with an assumption that those results and findings are true. Standing upon the shoulders of giants
Type I errors may jeopardy the whole enterprise of scientific research What if Newton’s laws were wrong? What if the belief on the link between high
cholesterol and heart disease is wrong?
How to Prevent or Minimize Type I Errors
Revisit the critical region
The role of the alpha value To minimize the chance of Type I errors occurring The measure of the probability of a Type I errors
Selecting the Alpha Level Set up the boundary of the critical region Measure the Type I error
A small alpha level Minimizes a Type I error But also demands more evidence from research
or even makes it impossible to reject the null hypothesis
Trade-off Three values 5%, 1%, and 0.1%
Type II Errors Less severer Fail to find something significant Personally, miss the chance to make a big
breakthrough Generally, science slows down a little bit
Example Population without stimulation µ = 80, σ = 20
Sample n = 25, M = 87
Solve It State the hypotheses Set the alpha value and the critical region Compute the sample statistics Make the decision Report the result The stimulation has a significant effect on match
skills (z =1.75, p < .05) Significant: statistically significant You can reject the null hypothesis p-value: the probability to make a Type I error Usually use the form of p < alpha
Weakest Link(s) in This Process What potential problems do you see that can
jeopardize the conclusion from the research?
Some Underlying Assumptions for Hypothesis Tests Random sampling Independent observation Observed data from subjects should be
independent The standard deviation is the same for the
sample and the population The treatment only affects the sample data by
adding a constant to each score Normal sampling distribution can be used to
analyze the problem Normally distributed scores, or A fairly large sample size
Recent Discussions on p value Statisticians Found One Thing They Can
Agree On: It’s Time To Stop Misusing P-Values http://fivethirtyeight.com/features/statisticians-
found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/
Sifting the evidence—what's wrong with significance tests? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11
19478/
Effect Size and Power Revisit the formula We use the standard error to measure the distance
between a sample mean and the population mean Standard error Determined by sample size
A very large sample size would make a very small sample error Increases the chance to make a sample mean fall into
the critical region Increasing the sample size will make a sample mean closer to
the population mean, but this is not guaranteed It is likely to have the null hypothesis rejected
How big is a significant effect really?
Example µ = 50, σ = 10, M = 51 when n = 25, σ M= z = Conclusion:
when n = 400, σ M= z = Conclusion:
So, if the sample size is large enough, even a very small treatment effect can be found statistically significant. How important is such a treatment effect?
Effect Size Help you to evaluate the impact of a
significant treatment effect Not just compare the z-score, but also compare
the mean difference in terms of the standard deviation
Cohen’s d A ratio of the mean difference to the standard
deviation Large, medium and small
Cohen’s d Measures how far away two different
populations are separated
Power For the same purpose, but to measure the
probability that the test will correctly reject the null hypothesis
Compares two normal distributions If two populations are claimed to be different,
what is proportion of the distribution of sample means of one population that fall into the critical region of the other?
Power and Effect Size They are both an indication of the strength of
magnitude of a treatment effect Power is influenced by many factors,
including the sample size Effect size is not.
Directional Hypothesis Test Two tailed hypothesis Critical regions are located on the both sides of
the distribution The null hypothesis: equal mean
Often, we have a hypothesis about an increased or decreased mean
We will need a directional hypothesis test
Difference Between One-Tailed and Two Tailed Hypothesis Tests Hypotheses
Two-tailed Η0 : µ with stimulation = 80 Η1: µ with stimulation ≠ 80
One-tailed Η0 : µ with stimulation ≤ 80 Η1: µ with stimulation > 80
Critical region Two-tailed: both sides
Dividing the alpha value by two and then finding z-scores One-tailed: one side
Finding the z-score based on the alpha value
The Problems of Using z How can we know the population standard
deviation? Very often, they are what researchers are
pursuing. If we don’t know it, how can we do the
hypothesis test?
We use the t statistics rather than z!
t-Tests
Hypothesis Test with t-Statistics Procedures are similar to those using z-
statistics, except Finding the critical region from the t-distribution
table Using estimated standard errors to compute the
scores
t Statistics Use estimated standard errors to replace standard
errors Estimated standard errors are from sample statistics,
rather than population parameter
Sample variance (unbiased variance)
Estimated standard error
dfSS
nSSS =−
=1
2
nSSM
2
=
t Statistics Use estimated standard errors to replace
standard errors Estimated standard errors are from sample
statistics, rather than population parameter.
Degree of freedom Critical region
Example Infants, even newborns, prefer to look at attractive
faces compared to less attractive faces (Slater, et al., 1998). Subjects: infants from 1 to 6 days IV: Face in photo DV: time to look at a photo (in second) Method:
showing two photographs of women's faces (one significantly more attractive than the other)
20 seconds in both
M =13 (attractive face), SS = 72, n = 9
Steps Hypothesis Η0 : µ attractive = 10 Η1: µ attractive ≠ 10
Critical region df = n - 1 = 8
Steps Calculation
M=13, n=9, SS = 72 t = ?
Conclusion
dfSS
nSSs
nss
sMt
M
M
=−
=
=
−=
12
2
µ
Effect size Cohen’s d
sM
deviation standardsampledifference meand µ−
==
Magnitude of d Evaluation of Effective Sized = 0.2 Small effect (mean difference: 0.2 SDd = 0.5 Medium effectd = 0.8 Large effect
Effect size Percentage of Variance Explained
Effect size
An easy way to calculate
5294.89
9ty variabilitotal
for accountedy variabilit2
22 =
+=
+==
dfttr
5294.15381
ty variabilitotalfor accountedy variabilit
==
Confidence Intervals for Estimating μ• Alternative technique for describing effect
size• Estimates μ from the sample mean (M)• Based on the reasonable assumption that M
should be “near” μ• Based on the estimated standard error of the
mean (sM)
Confidence Intervals for Estimating μ (continue)
• Every sample mean has a corresponding t:
• Rearrange the equations solving for μ:Ms
Mt µ−=
MtsM ±=µ
Distribution with df = 8
397.11300.1*397.113 ±=±=±= MtsMµ
We are 80% confident that the average time to look at the pretty face is 13 seconds with an error 1.397 seconds.
Report t-Test Results The subjects averaged M = 13 seconds on the more attractive
face with SD = 3.0. Statistical analysis indicated that the time spent on the attractive face was significantly more than would be expected by chance, t(8) = 3.00, p < .05, r2 = 52.94%
How about this? Statistical analysis indicated that the time spent on the
attractive face was significantly more than would be expected by chance, t(8) = 3.00, p <.05, r2 = 52.94%. The subjects averaged M = 13 seconds on the plain side of the apparatus with SD = 3.00.
General Rule Report the descriptive statistics first. Mean, standard deviation, …
Present inferential statistics. z, t, F, …
How about One Tailed?
This t-Test Is Better, But It still requires the knowledge of the
population mean. Often, we don’t know the population mean. In practice, not just inferring population
parameters based on samples, but also checking the mean difference between two populations based on two samples.
Two Different t-Tests Checking whether scores from two groups
are different? Between-subjects design
Checking whether scores from different treatments are different Within-subjects design
t-Test for Two Independent Samples
Between-Subjects Design
The t-Test for Independent Measures Hypotheses The null hypothesis: no difference between two
population H0: µ 1 = µ 2
The Alternative hypothesis: there is a mean difference H1: µ 1 ≠ µ 2
Set the Criteria The alpha value The critical region How to determine the df?
We have two samples Two dfs
The overall df is the sum of two dfs df = df1 + df2
Compute Statistics The t value
sample mean – hypothesized population mean t = -----------------------------------------------------------------
estimated standard error
For independent measuressample mean diff. – hypothesized population mean diff.
t = -----------------------------------------------------------------estimated standard error
sample mean diff. M1-M2= ----------------------------------- = ---------------
estimated standard error S(M1-M2)
MsMt µ−
=
HOW TO COMPUTE S(M1-M2) ?
Estimated Standard Error Measure of standard or average distance
between sample statistic (M1-M2) and the population parameter
Unbiased only if n1 = n2
2
22
1
2
)(1
21 ns
ns
s MM +=−
Pooled Variance Instead, we use a pooled variance to replace
s21 and s2
2
Pooled variance (sp2 provides an unbiased
basis for calculating the standard error)
21
212
dfdfSSSSsp +
+=
)1()1( 2121 −+−=+= nndfdfdf
Make a Decision Rejecting the null hypothesis M1- M2 ≠0 The mean difference between sample
represents the the mean difference between populations
Not rejecting the null hypothesis No evidence to show the sample means
are different
Example Impact of TV time on student academic
performances IV: Sesame street DV: high school grade
Steps
df = df1+ df2= (n − 1) + (n2 − 1)= 9 + 9= 18
α = 0.01
Calculation(M1-M2) – (µ1-µ2)
t = -------------------------------------s(M1-M2)
s(M1-M2)
2ps
Effect Size Cohen’s d Use sp
r2
dfttr+
= 2
22
Assumptions for independent-measures t-test Independent observation within each sample Two population are normal. Two population have equal variances.
Homogeneity of variance Hartley’s Fmax test:
(smallest)(largest)
2
2
max ssF =
Hartley’s F-max Test To test whether two
samples have equal variances. The desired outcome: fail
to reject the null hypothesis
F-max table α value, n, df http://archive.bio.ed.ac.uk
/jdeacon/statistics/table8.htm
(smallest)(largest)
2
2
max ssF =
t-Test for Two Related Samples
Within-Subjects Design Subjects are compared with themselves Different conditions No groups to compare
Data in different treatments are actually related
To study the impact of treatments, we can still use t-test Slightly different from comparing independent
samples
The Interest in Related Samples The mean difference between two population
The difference between scores in different samples In related samples, we can put scores into pairs
based on subjects This is impossible in independent samples
The target: the score difference D = X2-X1
Example “Swearing as a response to pain” Tolerance on the level of pain when saying Neutral vs. swearing word
Both conditions for each subject N S vs. S N
Procedures Hypotheses
H0: µD = 0 H1: µD ≠0
The alpha value and critical region 5% as usual
t statistic
Estimated standard errorDM
DD
sMt µ−
=
dfSSs
nss
DM
=
=
2
2
Example
Example
MD= -2
Example
MD= -2 SS = ΣD2 – (ΣD)2/N = 68-18*2 = 32
Example
MD= -2 SS = ΣD2 – (ΣD)2/n = 68-18*2 = 32 s2= SS/df = 32/8 = 4 SQRT(s2/n) = SQRT(4/9) = 2/3=
DMs
t = - 3.00
Effective Size Cohen’s d
r2
sMd D=
dfttr+
= 2
22
Uses and Assumptions of Repeated Measures t-test Fewer subjects Reduce individual differences
Order effects
Independent observations within each treatment. Normal distribution.
t-test for Unequal Variance Samples
2
22
1
21
ns
ns
sM +=
2
2
22
2
2
1
2
1
2
2
22
1
2
)(1
1)(1
1
)(
1
1
ns
nns
n
ns
ns
df
−+
−
+=
MsMt µ−
=
t-test in statistics software SPSS R t.test() function
Excel