normal distribution

Statistics: Unlocking the Power of Data Lock5

Normal Distribution

STAT 101

Dr. Kari Lock Morgan

Chapter 5• Normal distribution• Central limit theorem• Normal distribution for confidence intervals• Normal distribution for p-values• Standard normal


Re-grade Requests4e potential grading mistake: 0.025 is correct

Requests for a re-grade must be submitted in writing by class on Wednesday, March 5th

Partial credit will NOT be adjusted

Valid re-grade requests: You got points off but believe your answer is correct Points were added incorrectly

Warning: scores may go up or down


slope (thousandths)-60 -40 -20 0 20 40 60

Measures from Scrambled RestaurantTips Dot Plot

r-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Measures from Scrambled Collection 1 Dot Plot

Nullxbar98.2 98.3 98.4 98.5 98.6 98.7 98.8 98.9 99.0

Measures from Sample of BodyTemp50 Dot Plot

Diff-4 -3 -2 -1 0 1 2 3 4

Measures from Scrambled CaffeineTaps Dot Plot

xbar26 27 28 29 30 31 32

Measures from Sample of CommuteAtlanta Dot Plot

Slope :Restaurant tips

Correlation: Malevolent uniforms

Mean :Body Temperatures

Diff means: Finger taps

Mean : Atlanta commutes

phat0.3 0.4 0.5 0.6 0.7 0.8

Measures from Sample of Collection 1 Dot PlotProportion : Owners/dogs

What do you notice?

Bootstrap and Randomization Distributions


• The symmetric, bell-shaped curve we have seen for almost all of our bootstrap and randomization distributions is called a normal distribution

Normal Distribution

Freq

uenc

y

-3 -2 -1 0 1 2 3

050

010

0015

00


Central Limit Theorem!

For a sufficiently large sample size, the distribution of sample

statistics for a mean or a proportion is normal

www.lock5stat.com/StatKey

http://www.lock5stat.com/StatKey


Distribution of 100n30n10n1n

0.5p

0.0 0.5 1.0

50n

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.7p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.0 0.5 1.0

0.1p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0


CLT for a MeanPopulation Distribution of

Sample DataDistribution of Sample Means

n = 10

n = 30

n = 50

Freq

uenc

y

0 1 2 3 4 5 6

0.0

1.5

3.0

1.0 2.0 3.0

Freq

uenc

y

0 1 2 3 4 5

04

8

1.5 2.0 2.5 3.0

Freq

uenc

y

0 2 4 6 8 12

010

25

1.4 1.8 2.2 2.6

x

0 2 4 6 8 10


Central Limit Theorem

• The central limit theorem holds for ANY original distribution, although “sufficiently large sample size” varies

• The more skewed the original distribution is (the farther from normal), the larger the sample size has to be for the CLT to work

• For small samples, it is more important that the data itself is approximately normal


Central Limit Theorem

• For distributions of a quantitative variable that are not very skewed and without large outliers, n ≥ 30 is usually sufficient to use the CLT

• For distributions of a categorical variable, counts of at least 10 within each category is usually sufficient to use the CLT


Accuracy• The accuracy of intervals and p-values generated using simulation methods (bootstrapping and randomization) depends on the number of simulations (more simulations = more accurate)

• The accuracy of intervals and p-values generated using formulas and the normal distribution depends on the sample size (larger sample size = more accurate)

• If the distribution of the statistic is truly normal and you have generated many simulated randomizations, the p-values should be very close


• The normal distribution is fully characterized by it’s mean and standard deviation

Normal Distribution

mean,standard deviationN


Bootstrap DistributionsIf a bootstrap distribution is approximately normally distributed, we can write it as

a) N(parameter, sd)b) N(statistic, sd)c) N(parameter, se)d) N(statistic, se)sd = standard deviation of variablese = standard error = standard deviation of statistic


Hearing Loss• In a random sample of 1771 Americans aged 12 to 19, 19.5% had some hearing loss (this is a dramatic increase from a decade ago!)

• What proportion of Americans aged 12 to 19 have some hearing loss? Give a 95% CI.

Rabin, R. “Childhood: Hearing Loss Grows Among Teenagers,” www.nytimes.com, 8/23/10.

http://www.nytimes.com/


Hearing Loss

(0.177, 0.214)


Hearing Loss

N(0.195, 0.0095)


Confidence Intervals

If the bootstrap distribution is normal:

To find a P% confidence interval , we just need to find the middle P% of the distribution

N(statistic, SE)


Area under a Curve• The area under the curve of a normal distribution is equal to the proportion of the distribution falling within that range

• Knowing just the mean and standard deviation of a normal distribution allows you to calculate areas in the tails and percentiles

www.lock5stat.com/statkey

http://www.lock5stat.com/statkey


Hearing Loss

(0.176, 0.214)




Standardized DataOften, we standardize the data to have mean 0

and standard deviation 1

This is done with z-scores

From x to z : From z to x:

Places everything on a common scale

x meanzsd


Standard Normal• The standard normal distribution is the normal distribution with mean 0 and standard deviation 1

0,1N

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3


Standardized DataConfidence Interval (bootstrap distribution):

mean = sample statistic, sd = SE

From z to x: (CI)

x statist Eic Sz


z*-z*

P%

P% Confidence Interval2. Return to

original scale with statistic z* SE

1. Find z-scores (–z* and z*) that capture

the middle P% of the standard normal


Confidence Interval using N(0,1)

If a statistic is normally distributed, we find a confidence interval for the parameter using

statistic z* SE

where the area between –z* and +z* in the standard normal distribution is the desired

level of confidence.


Confidence IntervalsFind z* for a 99% confidence interval.


z* = 2.575



z*Why use the standard normal?

Common confidence levels:

95%: z* = 1.96 (but 2 is close enough)

90%: z* = 1.645

99%: z* = 2.576


In March 2011, a random sample of 1000 US adults were asked

“Do you favor or oppose ‘sin taxes’ on soda and junk food?”

320 adults responded in favor of sin taxes.

Give a 99% CI for the proportion of all US adults that favor these sin taxes.

From a bootstrap distribution, we find SE = 0.015

Sin Taxes


Sin Taxes


Randomization DistributionsIf a randomization distribution is approximately normally distributed, we can write it as

a) N(null value, se)b) N(statistic, se)c) N(parameter, se)


p-valuesIf the randomization distribution is normal:

To calculate a p-value, we just need to find the area in the appropriate tail(s) beyond the observed statistic of the distribution


First Born Children• Are first born children actually smarter?

• Explanatory variable: first born or not• Response variable: combined SAT score

• Based on a sample of college students, we find

• From a randomization distribution, we find SE = 37


First Born Children

SE = 37

What normal distribution should we use to find the p-value?

a) N(30.26, 37)b) N(37, 30.26)c) N(0, 37)d) N(0, 30.26)


Hypothesis TestingDistribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic


Statistic

-3 -2 -1 0 1 2 3


Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

p-value


First Born ChildrenN(0, 37)


p-value = 0.207



Standardized DataHypothesis test (randomization distribution):

mean = null value, sd = SE

From x to z (test) :

x meanzsd


p-value using N(0,1)

If a statistic is normally distributed under H0, the p-value is the probability a standard normal is beyond


First Born Children

SE = 37

1) Find the standardized test statistic

2) Compute the p-value


First Born Children


z-statistic

If z = –3, using = 0.05 we would

(a) Reject the null(b) Not reject the null(c) Impossible to tell(d) I have no idea


z-statistic

• Calculating the number of standard errors a statistic is from the null value allows us to assess extremity on a common scale


Confidence Interval Formula

*sample statistic z SE

From original data

From bootstrap

distribution

From N(0,1)

IF SAMPLE SIZES ARE LARGE…


Formula for p-values

From randomization

distribution

From H0

sample statistic null valueSE

z

From original data

Compare z to N(0,1) for p-value

IF SAMPLE SIZES ARE LARGE…


Standard Error

• Wouldn’t it be nice if we could compute the standard error without doing thousands of simulations?

• We can!!!

• Or at least we’ll be able to next class…


• For quantitative data, we use a t-distribution instead of the normal distribution

•The t distribution is very similar to the standard normal, but with slightly fatter tails (to reflect the uncertainty in the sample standard deviations)

t-distribution


• The t-distribution is characterized by its degrees of freedom (df)

• Degrees of freedom are based on sample size• Single mean: df = n – 1 • Difference in means: df = min(n1, n2) – 1• Correlation: df = n – 2

• The higher the degrees of freedom, the closer the t-distribution is to the standard normal

Degrees of Freedom


t-distribution


Aside: William Sealy Gosset


The Pygmalion Effect

Source: Rosenthal, R. and Jacobsen, L. (1968). “Pygmalion in the Classroom: Teacher Expectation and Pupils’ Intellectual Development.” Holt, Rinehart and Winston, Inc.

Teachers were told that certain children (chosen randomly) were expected to be intellectual “growth spurters,” based on the Harvard Test of Inflected Acquisition (a test that didn’t actually exist). These children were selected randomly.

The response variable is change in IQ over the course of one year.


The Pygmalion Effect

n sControl Students 255 8.42 12.0“Growth Spurters” 65 12.22 13.3

X

Can this provide evidence that merely expecting a child to do well actually causes the child to do better?

If so, how much better?

*s1 and s2 were not given, so I set them to give the correct p-value

SE = 1.8


Pygmalion Effect


Pygmalion EffectFrom the paper: “The difference in gains could be ascribed

to chance about 2 in 100 times”


Pygmalion Effect


To DoDo Project 1 (due 3/7)

Read Chapter 5

http://stat.duke.edu/courses/Spring13/sta101.002/project1.pdf

normal distribution

Documents

power of data lock5statistics

power of data lock5clt

power of data lock5accuracy

power of data lock5

power of data lock5

original distribution

clt statistics

sample size larger sample