lecture 7 inferential statistics: hypothesis testing

Lecture 7

Inferential Statistics: Hypothesis Testing

Preview Jason decides to sue the city after he was

involved in a near fatal collision at Division St. and Trailridge Dr. Jason is claiming that his accident could have been prevented if the city had placed a stoplight at the intersection. It turns out that this intersection has had an abnormally high number of accidents in the past 5 years. The city is arguing that the number of accidents at this particular intersection is not abnormal and that there are no more accidents at this particular intersection than others in town.

What does Jason’s story have to do with statistics?

The primary application of inferential stats is to help researchers interpret their data.(1) Are the differences in the data due to chance?

(2) Are the differences in the data something more than chance.

In the example with Jason(1) Are their an increased number of accidents due

to external factors (lack of a stoplight not chance)?

(2) Is the number of accidents at that particular intersection no more than chance?

Hypothesis testing is a statistical procedure that allows researchers to use sample data to draw inferences about the population of interest– Is the mean of my observed sample consistent with

the known population mean or did it come from some other distribution

To Start… Is the mean of my

observed sample consistent

with the known population

mean or did it come from

some other distribution? We are given the following problem:

– There exists a sample of cars (some kind)– They get an mean MPG of 19 miles– Are they midsize cars (we can’t go look at them)

We know:– A midsize car gets 18 MPG– Is 19 different enough from 18 in this distribution– Or is it part of some other distribution

Example

Here’s what we know: = 18

– M = 19

M = 0.4

M - M

Z = z = (19 - 18) / 0.4

z = +2.5

p = .0062 or PR = .62%

How do we decide: More than intuition

If the z-score falls outside the middle of 95% of the curve, it must be from some other distributions (yesterday p<.05 convention in psychology)

Main assumption: We assume that weird, unusual, or rare things don’t happen

If a score falls out into the 5% range we conclude that it “must be” from some other distribution. Less than %5 is rare enough

Hypothesis Testing

Hypothesis Testing: a statistical procedure that allows researchers to use sample data to draw inferences about a population

– Use the concepts of:• z-scores• probability• distribution of sample means

Basic Logic of Hypothesis Testing(1) State a hypothesis about the population

– Hypothesis: prediction about the relationship between variables; how IV affects DV– e.g. People who prefer Hagen Daaz ice cream will have a mean IQ that is higher than

average at 130

(2) Use the hypothesis to predict the characteristics the sample should have

(3) Obtain a random sample– random sampling: when all potential observations in the population have equal

chances of being selected.– RANDOM: Survey every 100th house from the list of all addresses in Tucson.– NOT RANDOM: Survey the internet users about access to new technology in their

schools

(4) Compare sample data with hypothesis– using a statistical test (today we’ll continue to use z-tests, but keep in mind that other

statistical stats can be used in a similar fashion)

Unknown Population

= 24 = ?

Known population before treatment

Unknown population after treatment

Treatm

ent

= 4 = 4

One basic assumption = If the treatment has any effect it is simply to add a constant amount to (or subtract a constant amount from) each individual’s score

No change in shape of distribution or standard deviation

Unknown pop. is just theoretical (we never administer a treatment to the entire pop.), but we do have a real sample that represents the pop., so this is what we use

Rules of Hypothesis Testing (1)(1) STATE THE HYPOTHESIS about the

unknown population mean.– null hypothesis: H0 = statement that the treatment

has no effect; IV has no effect on DV.

– alternative hypothesis: H1 = treatment had an effect on DV. Alternative hypothesis does not specify direction of change. It some cases it might be useful to specify (we’ll get to that).

– NOTE: the null and alternative hypotheses are mutually exclusive and exhaustive. They can’t both be true.

Example: on average the population remembers 7 words in a particular situation with a SD = 2. You test a new intervention designed to enhanced memory with 10 participants and find that they remember an average of 9 words.

H0 = new intervention = 7

H1 = new intervention = 7

Rules of Hypothesis Testing (2)(2) SET THE CRITERIA for a decision

– Data will either support or refute the null hypothesis

– Distribution gets divided into 2 sections:• Sample means that are likely if H0 is true.• Sample means that are very unlikely if H0 is true

– Must set the boundaries that indicate the high-p samples from the low-p samples.

• Level of significance or alpha level () make critical region

• Convention says = .05 or 5%, but other commonly used alpha levels are .01 (1%) and .001 (0.1%).

• A z-score can mark the boundary set by alpha!

* Because the extreme 5% can be split between 2 tails there is 2.5% or .025 in each tail (2-tailed)

(3) COLLECT DATA and compute sample statistics– Select a random sample from the population– NOTE: it is important to collect the data after stating the

hypothesis and establishing the criteria in order to make an objective evaluation of the data

– Compute a sample/ test statistic (today we are illustrating hypothesis testing through a sample statistic we already know z-scores).

• We don’t know , so we make a hypothesis about the value and then plug it in to evaluate our hypothesis.

Rules of Hypothesis Testing (3)

M - M

Z = Test statistic: sample data are converted to a single, statistic that is used to test the hypothesis.


z = 9 - 7 / (2 /√10)

z = 2 / 0.63

z = 3.16


M - M

Z =

The z-score could also be expressed in words to fit into the context of hypothesis testing and inferential stats:

z = sample mean - hypothesized population mean

standard error between X and In other words z = obtained difference / chance; if 0 we know null was true difference is not greater than chance; if >1 then due to more than just chance. But give rule #2, we know we want more than chance between 2 and 3 times chance.

(4) MAKE A DECISION– Use the sample statistic (z-score) calculated in step

3 to make a decision about the null hypothesis• Reject the null hypothesis: if sample data fall in the critical

region. Data collected demonstrates that the treatment really works.

• Fail to reject the null hypothesis: sample data do not fall in the critical region. Data collected is not convincing, so you concluded there is currently not enough evidence.



z = 3.16; that is beyond the border of z = +/- 1.96, so our data falls in the critical region. p < .05.

We reject the null hypothesis! Our memory enhancement technique works! We say our sample is statistically significant.

Results of Hypothesis Testing:Uncertainty and Error

Hypothesis testing is an inferential process, so it uses limited info to reach a general conclusion about a sample representing a population.

Support for H1 is indirect:– We can’t prove the alternative hypothesis, we can

only support it– Easier to show that null is false

Two types of errors:– Type I– Type II

If we reject the null hypothesis do we accept that the alternative hypothesis is true?– Almost, if we reject the null, we have strong

support that the alternative is true.

If we do not reject the null hypothesis do we “accept” that the null is true?– NO!! There are lots of reasons for not rejecting the

null hypothesis. If we reject we were only unable to find support for our alternative hypothesis

• Often researchers run the experiment again, changing a few small elements in order to make their test more sensitive.

Results of Hypothesis Testing:Uncertainty

Type I Error: null hypothesis is true, but researcher rejects it.– Probability of a Type I error is equal to alpha

Type I errors have serious implications– Likely that the research will report or publish these results. Other researcher

may try to build theories or develop other experiments based on these false results.

– Fortunately, we structure the hypothesis test to make this relatively unlikely. And the researcher gets to choose alpha!!

Results of Hypothesis Testing: ERROR


Turns out that by chance we selected an extreme sample. Our sample has an average IQ of 130, so they are “smarter” than average. And prior research has shown that IQ is correlated with working memory, such that individuals with a higher IQ have a higher working memory. Our results are simply due to this confound of IQ.

Type II Error: null hypothesis is false, but researcher fails to reject it.– Often happens when treatment effect is small OR

variance is big– Impossible to determine a single exact probability

value for a type II error. Depend on multiple factors.

– Represented by Greek letter beta, Consequences:

– Not as serious as Type I.– Only means that one particular exp. does not show

evidence for the alternative hypothesis. 2 choices:• Accept this outcome and assuming the effect is not worth

pursuing• Repeat experiment with improvements

Results of Hypothesis Testing: ERROR

Possible Outcomes of a Statistical Decision

DataH0 True H0 False

Reject H0

Accept H0

Experimenter’s

Decision

Type I error

false start

correct

(1 - )

correct

(1 - )

Type II error

miss

Let’s think about what happens as a result of our decision

What if we were looking to see if an individual were guilty of a crime?– Null hypothesis = the person is innocent

there is no crime.– Type I error - rejecting the null when it is

true• We send an innocent person to prison (false

alarm)

– Type II error - Not rejecting a false null hypothesis

• We set a guilty person free (miss)

Let’s Try One Is the mean of my observed

sample consistent with the

known population mean or did

it come from some other

distribution?

• We are in a sci-fi film

• There is a sample of beings (n = 5).

• On average they are 8 feet tall .

• Are they humans or nonhuman?

• We know that humans in sci-fi movies average 5.5 feet tall with a SD = 1.5.

• Is 8 different enough from 5.5 to be in some other distribution?

Hypothesis Testing

(1) State hypotheses.

(2) Set criterion

(3) Collect data (done) do a test stat

(4) Make a decision

Let’s Do One: Evidence(1) H0 = new beings = 5.5

H1 = new beings = 5.5

(2) Set criterion: = .05, so .05/2 = .025 in each tail (2-tailed). Critical z = +/- 1.96.

(3) Test statistic

z = 8 - 5.5 / (1.5/√5)

z = 2.5 / .67

z = 3.73

(4) Decision - reject the null. The beings are not humans

M - M

Z =

In the Literature Findings are said to be significant or statistically

significant– The the height of the beings in the film is significantly

different from the height of human beings, z = 3.73, p < .05– There was no evidence the the height of the beings in the

film was different from the height of human beings, z = .89, p < .05

APA dictates that no 0 should be precede the decimal place

When using a statistical program report the exact p value, z = 2.45, p = .0142

Scientific papers don’t report null or alternative hypotheses, but they are an imprt logic part of hypothesis testing

Try One… A psychologist examined the effect of

chronic alcohol abuse on memory. In this experiment a standardized memory test was used. Scores on this test for the general population form a normal distribution with = 50 and = 6. A sample of n = 22 alcohol abusers has a mean score of M = 47. Is there evidence for memory impairment among alcoholics? Use a criterion of alpha = .01

So far, 2-tailed or non-directional hypothesis tests (Most widely accepted procedure for hypothesis testing)– Two-tailed or nondirectional test: regions of rejection

are located in both tails of the distribution, alpha is divided

1-tailed or directional hypothesis tests:– One-tailed or directional test: region of rejection is

located in just one tail of the distribution, alpha is not divided

– The statistical hypotheses (H0 and H1) specify either an increase or decrease in the population mean score.

– Researcher must begin with a specific prediction about the direction of the treatment effect a priori!!

Directional Hypothesis Tests

Hypothesis for Directional Tests

STATE THE HYPOTHESIS– Still about the unknown population mean

– Null hypothesis = H0 = no effect

– Alternative hypothesis H1 = effect in a particular decision

Example: A therapist is trying to find the best way to treat depression. She decides that mediation will boost mood in depressed people. The average Mood score for depressed people on a standardized test for depression is 25 with a SD = 5. The therapist predicts that mediation will boost the mood score. She take a sample of 10 depressed people and teaches them to practice mediation. The mean mood score for this sample is 28.

H0 = meditation 25

H1 = meditation > 25

Critical Region for Directional Tests Critical Region is located entirely in one

tail of the distribution.– Good because…more sensitive to finding an

effect if the predicted direction is correct.– Bad because…completely unable to find an

effect if your predicted direction is wrong.

* Critical z is less because we don’t have to divide our proportion (.05) in half to account for both tails

Test statistic and Decision: for Directional Hypothesis Testing Step 3 and 4 of hypothesis testing are the

same in directional tests. But…we still need to finish our example.Example: A therapist is trying to find the best way to treat depression. She decides that mediation will boost mood in depressed people. The average Mood score for depressed people on a standardized test for depression is 25 with a SD = 5. The therapist predicts that mediation will boost the mood score. She take a sample of 10 depressed people and teaches them to practice mediation. The mean mood score for this sample is 28.

z = 28 - 25/ (5/ √10)

z = 3 / 1.58

z = 1.9

We reject the null hypothesis. Meditation helps depressed people. Please note if we had decided to do a 2-tailed test we would not have been able to reject the null. Why we decide a priori…

Let’s Try One

You are testing a new diet drug. Americans eat an average of 2000 calories per day with a SD of 500. You want to see if the drug decreases the amount of calories consumed. You take a sample of 10 people who eat an average of 1700 calories per day. Did your diet drug work?

Answers

(1) H0 = diet drug > 2000

H1 = diet drug 2000

(2) Set alpha level at .05, since this is a one-tailed test we can find the critical z-score for .05 in the tail. Critical z = - 1.65

(3) z = 1700 - 2000 / (500 / √10)

z = -300 / 158.11

z = -1.89

(4) Reject the null hypothesis. Our diet drug works!

One vs. Two tails Some contend two-tailed test is more rigorous

and there for more convincing– Requires more evidence to reject the null

Others feel that one-tailed tests are better because they are more sensitive– More precise test specific hypotheses

In general 2-tailed should be used when there is no strong directional expectation OR when there are two competing predictions– e.g. on theory predicts an increase of scores

following treatment while another predicts a decrease

Never use 1-tailed as a second attempt for significance

Statistical Power Power - the probability that the test will

correctly reject a false null hypothesis (1 - ). Or the likelihood that we will obtain sample data in the critical region.

The more powerful our statistical test the more readily we will detect a treatment effect when one really exists.

Power

Treatment size the larger the treatment size the greater the power

Alpha level - reducing alpha decreases power.

One-tailed v. two-tailed tests– One-tailed tests will cause a larger

proportion of the treatment distribution to be in the critical region

Sample Size – Larger the sample the smaller the standard

error, so the more separate the distributions

Statistical Power

Assumptions for the Hypothesis Test with Z-scores

Random sampling– In order to generalize the findings from our

sample to the population we need to select the sample randomly, so we don’t add bias

Independent observations– There can’t be a consistent and predictable

relationship between 2 data points• e.g. in a coin toss even you if you just tossed

the coin 4 times and all of those times got “head” there is still a 50% chance of getting heads the next time.

• Gambler’s fallacy

Assumptions for the Hypothesis Test with Z-scores

The value of is unchanged by the treatment– This assumes that the treatment effect is constant

and additive (or subtractive) – So, the mean should change, but not the standard

deviation should not– This is a theoretical ideal. In actual

experiments the treatment may varying a bit. Normal sampling distribution

- To evaluation z-scores (this will change slightly with other test statistics will will introduce in the following days) we have to use the unit normal table and that requires a distribution of sample means that is normal

Violations of any of these assumption will invalidate the any results of our experiments!

Criticisms about Hypothesis Testing

(1) Hypothesis testing doesn’t tell you very much -- just statistical significance and direction, not the size or location of the results.– It give us all or nothing information– What really is the difference between a z-

score of 1.88 (p < .06) and 1.96 (p < .05)?– Some people respond that some criterion

level has to be set, so that while .05 is arbitrary, it is necessary.

(2) Idea of a null hypothesis is artificial.– Every treatment must have some effect.– This means you can never have a Type I

error.– But, because we are using an inferential

process it is impossible to prove the alternative hypothesis true. We can only show that they null hypothesis is very unlikely.


(3) Anything is statistically significant with a large enough N - H0 is always false

(4) Statistical significance can arise spuriously (e.g., family-wise error, the tendency of measures from the same study to be related [“the crud factor”]).

(5) Significance testing places too great an emphasis on Type I error and not enough on Type II error


(6) Hypothesis testing rejects the null to conclude that a treatment has a significant effect, but it does not mean the treatment has a “substantial” effect


Most important criticism…How do we speak to this?

Measuring Effect Size To correct the fact that we don’t know

“substantial” we report effect size Cohen’s d = mean difference / sd Standardizes the mean diff. in terms of

standard deviation (like z scores) Diet drug example:

– x = 300 / 500 (note sample size is not x = .6 taken into account)

Magnitude0 < d < .2 small effect (diff. less than .2 sd)

.2 < d < .8 medium effect (diff. around .5 sd) d > .8 large effect ( diff. greater than . 8 sd)

Homework Chapter 8

1, 2, 7, 8, 11, 16, 18, 25, 26, 30

lecture 7 inferential statistics: hypothesis testing

Documents

sample data

sample of cars

random sample random

population of interestis

treatmentunknown population

particular intersection

mean mpg

mean iq