hypothesis test - pennsylvania state university

Hypothesis Test

An Old Example Again

How about This? Students from University A rush to Fort Lauderdale.

We know the average SAT score of students from this university is 600, with a standard deviation of 15. Sampled 100 college students in Fort Lauderdale and

found their average SAT score is 620 Are they from University A?

Does one sample belongs to a known population or to a totally different population? Two populations:

Totally different groups of people Same group of people before and after a particular treatment

Another Example: Does A Treatment Have An Effect?

=?

Hypothesis Test What is hypothesis testing? Use sample statistics to evaluate a hypothesis

about a population Steps

1. State the hypotheses about the value of the population mean

2. Set the criteria for a decision3. Collect data and compute sample statistics4. Make a decision based on the criteria and

statistics

The logic of Hypothesis Testing

µ M

σ

Example A study on the effect of mild electrical brain

stimulation mathematics skills IV: electric current DV: scores of standard math test Construct validity Stimulation: electric current Math skills: test score

Step 1: Hypotheses Two hypotheses are needed here The null hypothesis The IV has no effect on the DV

The sample mean is “equal” to the population mean H0: µ with stimulation = 80

The alternative hypothesis The IV has an effect on the DV.

The sample mean is not “equal” to the population mean H1: µ with stimulation ≠ 80

The H0 and H1 are mutually exclusive and exhaustive Only one can be true One must be true

Step 2: Set the Criteria When we have a sample mean different from

the population mean, we often ask a question: How likely is the difference due to random errors

rather than system errors? For hypothesis testing, this question is: How likely can we get this particular sample mean

if H0 is true? We need the criteria for decision making.

The Alpha Level A value to separate the high-probability

samples from the low probability samples Also called the level of significance Define the very unlikely sample outcomes if

the null hypothesis is true Must be small 5%, 1%, 0.1%

Its implication The probability to make a mistaken claim is less than

the alpha level

Critical Region The region composed of extreme sample

values Value falling into the critical region is very

unlikely to occur if H0 is true If a sample mean falls into the region It is unlikely the sample is from the population. The sample may come from a different population

with a different population mean The null hypothesis should be rejected.

In Practice … We use z-scores to specify

the boundaries that defines the critical region

We need the unit normal table to locate the z-scores

One tail or two tails Depend on your null

hypothesis

For this example, two tails: z-score: +/- 1.96

Step 3: Collect Data and Compute Statistics Data can be collected in different ways Experiments Surveys

Statistics computation Find the mean Find the corresponding z-score Out interests: sample means Standard error should be used

Step 4: Make a Decision Two possible outcomes Reject the null hypothesis Fail reject the null hypothesis

Rejecting the null hypothesis The z-score for the sample mean is beyond the z-

scores defining the critical region Big discrepancy between the sample and the null hypothesis

Unlikely to happen if the null hypothesis is true The alternative hypothesis is true The treatment has an impact

Μ = 92, σ = 20, sample size = 25 σ M= ? z =

Step 4: Make a Decision (Cont.) Failing to reject the null hypothesis The sample does not fall in the critical region You cannot reject the null hypothesis

This does not mean the null hypothesis is true It may be false, but the study fails to prove it

Μ = 84, σ = 20, sample size = 25 z = ?

Two outcomes: You have enough evidence to show the treatment has

an impact Evidence you gathered is not convincing. You cannot

prove the null hypothesis is wrong. All you can say is that your data fail to show the treatment has an impact. You cannot say the treatment has no effect.

Analogy: Hypothesis Test as Jury Trial

Null hypothesis The defendant is innocent until proven guilty.

H0: Defendant = not guilty H1: Defendant = guilty

Alpha level The jury must be convinced beyond a reasonable doubt before they believe

that the defendant is guilty The probability to wrongly convict the defendant

Critical region Sufficient evidence

Sample data Evidence presented by the prosecutors

Decision Reject the null hypothesis: the defendant is guilty

Sufficient evidence beyond a reasonable doubt The defendant was wrongfully convicted. Evidence is wrong.

Fail to reject the null hypothesis Fail to find the defendant is guilty based on given evidence The defendant could be guilty, but evidence is not strong enough

Uncertainty and Errors Hypothesis testing is an

inferential process Its conclusion could be

correct or incorrect. Four different outcomes

Which Error is More Dangerous?

Two Types of Errors Type I errors Reject the null hypothesis that is actually true The treatment is claimed to have an effect

although it actually does not. Type II errors Fail to reject the null hypothesis that is actually

false The treatment indeed has an effect, but the study

fails to find it.

The Danger of Type I Errors Rejecting the null hypothesis is very tempting. It means a scientific discovery.

But a false discovery could be fatal. Research as building unit theory Research is built upon previous results and

findings with an assumption that those results and findings are true. Standing upon the shoulders of giants

Type I errors may jeopardy the whole enterprise of scientific research What if Newton’s laws were wrong? What if the belief on the link between high

cholesterol and heart disease is wrong?

How to Prevent or Minimize Type I Errors

Revisit the critical region

The role of the alpha value To minimize the chance of Type I errors occurring The measure of the probability of a Type I errors

Selecting the Alpha Level Set up the boundary of the critical region Measure the Type I error

A small alpha level Minimizes a Type I error But also demands more evidence from research

or even makes it impossible to reject the null hypothesis

Trade-off Three values 5%, 1%, and 0.1%

Type II Errors Less severer Fail to find something significant Personally, miss the chance to make a big

breakthrough Generally, science slows down a little bit

Example Population without stimulation µ = 80, σ = 20

Sample n = 25, M = 87

Solve It State the hypotheses Set the alpha value and the critical region Compute the sample statistics Make the decision Report the result The stimulation has a significant effect on match

skills (z =1.75, p < .05) Significant: statistically significant You can reject the null hypothesis p-value: the probability to make a Type I error Usually use the form of p < alpha

Weakest Link(s) in This Process What potential problems do you see that can

jeopardize the conclusion from the research?

Some Underlying Assumptions for Hypothesis Tests Random sampling Independent observation Observed data from subjects should be

independent The standard deviation is the same for the

sample and the population The treatment only affects the sample data by

adding a constant to each score Normal sampling distribution can be used to

analyze the problem Normally distributed scores, or A fairly large sample size

Recent Discussions on p value Statisticians Found One Thing They Can

Agree On: It’s Time To Stop Misusing P-Values http://fivethirtyeight.com/features/statisticians-

found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/

Sifting the evidence—what's wrong with significance tests? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11

19478/

http://fivethirtyeight.com/features/statisticians-found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1119478/

Effect Size and Power Revisit the formula We use the standard error to measure the distance

between a sample mean and the population mean Standard error Determined by sample size

A very large sample size would make a very small sample error Increases the chance to make a sample mean fall into

the critical region Increasing the sample size will make a sample mean closer to

the population mean, but this is not guaranteed It is likely to have the null hypothesis rejected

How big is a significant effect really?

Example µ = 50, σ = 10, M = 51 when n = 25, σ M= z = Conclusion:

when n = 400, σ M= z = Conclusion:

So, if the sample size is large enough, even a very small treatment effect can be found statistically significant. How important is such a treatment effect?

Effect Size Help you to evaluate the impact of a

significant treatment effect Not just compare the z-score, but also compare

the mean difference in terms of the standard deviation

Cohen’s d A ratio of the mean difference to the standard

deviation Large, medium and small

Cohen’s d Measures how far away two different

populations are separated

Power For the same purpose, but to measure the

probability that the test will correctly reject the null hypothesis

Compares two normal distributions If two populations are claimed to be different,

what is proportion of the distribution of sample means of one population that fall into the critical region of the other?

Power and Effect Size They are both an indication of the strength of

magnitude of a treatment effect Power is influenced by many factors,

including the sample size Effect size is not.

Directional Hypothesis Test Two tailed hypothesis Critical regions are located on the both sides of

the distribution The null hypothesis: equal mean

Often, we have a hypothesis about an increased or decreased mean

We will need a directional hypothesis test

Difference Between One-Tailed and Two Tailed Hypothesis Tests Hypotheses

Two-tailed Η0 : µ with stimulation = 80 Η1: µ with stimulation ≠ 80

One-tailed Η0 : µ with stimulation ≤ 80 Η1: µ with stimulation > 80

Critical region Two-tailed: both sides

Dividing the alpha value by two and then finding z-scores One-tailed: one side

Finding the z-score based on the alpha value

The Problems of Using z How can we know the population standard

deviation? Very often, they are what researchers are

pursuing. If we don’t know it, how can we do the

hypothesis test?

We use the t statistics rather than z!

t-Tests

Hypothesis Test with t-Statistics Procedures are similar to those using z-

statistics, except Finding the critical region from the t-distribution

table Using estimated standard errors to compute the

scores

t Statistics Use estimated standard errors to replace standard

errors Estimated standard errors are from sample statistics,

rather than population parameter

Sample variance (unbiased variance)

Estimated standard error

dfSS

nSSS =−

=1

2

nSSM

2

=

t Statistics Use estimated standard errors to replace

standard errors Estimated standard errors are from sample

statistics, rather than population parameter.

Degree of freedom Critical region

Example Infants, even newborns, prefer to look at attractive

faces compared to less attractive faces (Slater, et al., 1998). Subjects: infants from 1 to 6 days IV: Face in photo DV: time to look at a photo (in second) Method:

showing two photographs of women's faces (one significantly more attractive than the other)

20 seconds in both

M =13 (attractive face), SS = 72, n = 9

Steps Hypothesis Η0 : µ attractive = 10 Η1: µ attractive ≠ 10

Critical region df = n - 1 = 8

Steps Calculation

M=13, n=9, SS = 72 t = ?

Conclusion

dfSS

nSSs

nss

sMt

M

M

=−

=

=

−=

12

2

µ

Effect size Cohen’s d

sM

deviation standardsampledifference meand µ−

==

Magnitude of d Evaluation of Effective Sized = 0.2 Small effect (mean difference: 0.2 SDd = 0.5 Medium effectd = 0.8 Large effect

Effect size Percentage of Variance Explained

Effect size

An easy way to calculate

5294.89

9ty variabilitotal

for accountedy variabilit2

22 =

+=

+==

dfttr

5294.15381

ty variabilitotalfor accountedy variabilit

==

Confidence Intervals for Estimating μ• Alternative technique for describing effect

size• Estimates μ from the sample mean (M)• Based on the reasonable assumption that M

should be “near” μ• Based on the estimated standard error of the

mean (sM)

Confidence Intervals for Estimating μ (continue)

• Every sample mean has a corresponding t:

• Rearrange the equations solving for μ:Ms

Mt µ−=

MtsM ±=µ

Distribution with df = 8

397.11300.1*397.113 ±=±=±= MtsMµ

We are 80% confident that the average time to look at the pretty face is 13 seconds with an error 1.397 seconds.

Report t-Test Results The subjects averaged M = 13 seconds on the more attractive

face with SD = 3.0. Statistical analysis indicated that the time spent on the attractive face was significantly more than would be expected by chance, t(8) = 3.00, p < .05, r2 = 52.94%

How about this? Statistical analysis indicated that the time spent on the

attractive face was significantly more than would be expected by chance, t(8) = 3.00, p <.05, r2 = 52.94%. The subjects averaged M = 13 seconds on the plain side of the apparatus with SD = 3.00.

General Rule Report the descriptive statistics first. Mean, standard deviation, …

Present inferential statistics. z, t, F, …

How about One Tailed?

This t-Test Is Better, But It still requires the knowledge of the

population mean. Often, we don’t know the population mean. In practice, not just inferring population

parameters based on samples, but also checking the mean difference between two populations based on two samples.

Two Different t-Tests Checking whether scores from two groups

are different? Between-subjects design

Checking whether scores from different treatments are different Within-subjects design

t-Test for Two Independent Samples

Between-Subjects Design

The t-Test for Independent Measures Hypotheses The null hypothesis: no difference between two

population H0: µ 1 = µ 2

The Alternative hypothesis: there is a mean difference H1: µ 1 ≠ µ 2

Set the Criteria The alpha value The critical region How to determine the df?

We have two samples Two dfs

The overall df is the sum of two dfs df = df1 + df2

Compute Statistics The t value

sample mean – hypothesized population mean t = -----------------------------------------------------------------

estimated standard error

For independent measuressample mean diff. – hypothesized population mean diff.

t = -----------------------------------------------------------------estimated standard error

sample mean diff. M1-M2= ----------------------------------- = ---------------

estimated standard error S(M1-M2)

MsMt µ−

=

HOW TO COMPUTE S(M1-M2) ?

Estimated Standard Error Measure of standard or average distance

between sample statistic (M1-M2) and the population parameter

Unbiased only if n1 = n2

2

22

1

2

)(1

21 ns

ns

s MM +=−

Pooled Variance Instead, we use a pooled variance to replace

s21 and s2

2

Pooled variance (sp2 provides an unbiased

basis for calculating the standard error)

21

212

dfdfSSSSsp +

+=

)1()1( 2121 −+−=+= nndfdfdf

Make a Decision Rejecting the null hypothesis M1- M2 ≠0 The mean difference between sample

represents the the mean difference between populations

Not rejecting the null hypothesis No evidence to show the sample means

are different

Example Impact of TV time on student academic

performances IV: Sesame street DV: high school grade

Steps

df = df1+ df2= (n − 1) + (n2 − 1)= 9 + 9= 18

α = 0.01

Calculation(M1-M2) – (µ1-µ2)

t = -------------------------------------s(M1-M2)

s(M1-M2)

2ps

Effect Size Cohen’s d Use sp

r2

dfttr+

= 2

22

Assumptions for independent-measures t-test Independent observation within each sample Two population are normal. Two population have equal variances.

Homogeneity of variance Hartley’s Fmax test:

(smallest)(largest)

2

2

max ssF =

Hartley’s F-max Test To test whether two

samples have equal variances. The desired outcome: fail

to reject the null hypothesis

F-max table α value, n, df http://archive.bio.ed.ac.uk

/jdeacon/statistics/table8.htm

(smallest)(largest)

2

2

max ssF =

http://archive.bio.ed.ac.uk/jdeacon/statistics/table8.htm

t-Test for Two Related Samples

Within-Subjects Design Subjects are compared with themselves Different conditions No groups to compare

Data in different treatments are actually related

To study the impact of treatments, we can still use t-test Slightly different from comparing independent

samples

The Interest in Related Samples The mean difference between two population

The difference between scores in different samples In related samples, we can put scores into pairs

based on subjects This is impossible in independent samples

The target: the score difference D = X2-X1

Example “Swearing as a response to pain” Tolerance on the level of pain when saying Neutral vs. swearing word

Both conditions for each subject N S vs. S N

Procedures Hypotheses

H0: µD = 0 H1: µD ≠0

The alpha value and critical region 5% as usual

t statistic

Estimated standard errorDM

DD

sMt µ−

=

dfSSs

nss

DM

=

=

2

2

Example

Example

MD= -2

Example

MD= -2 SS = ΣD2 – (ΣD)2/N = 68-18*2 = 32

Example

MD= -2 SS = ΣD2 – (ΣD)2/n = 68-18*2 = 32 s2= SS/df = 32/8 = 4 SQRT(s2/n) = SQRT(4/9) = 2/3=

DMs

t = - 3.00

Effective Size Cohen’s d

r2

sMd D=

dfttr+

= 2

22

Uses and Assumptions of Repeated Measures t-test Fewer subjects Reduce individual differences

Order effects

Independent observations within each treatment. Normal distribution.

t-test for Unequal Variance Samples

2

22

1

21

ns

ns

sM +=

2

2

22

2

2

1

2

1

2

2

22

1

2

)(1

1)(1

1

)(

1

1

ns

nns

n

ns

ns

df

−+

−

+=

MsMt µ−

=

t-test in statistics software SPSS R t.test() function

Excel

hypothesis test - pennsylvania state university

Documents