for 95 out of 100 (large) samples, the interval will contain the true population mean. but we...

40

Upload: janice-newman

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 2: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

For 95 out of 100 (large) samples, the interval

will contain the true population mean.

nx x96.1

But we don’t know ?!

Page 3: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Inference for the Mean of a Population

To estimate , we use a confidence interval around x.

The confidence interval is built with , which we replace with s (the sample std. dev.) if is not known.

nx x96.1

Page 4: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

t-distributions

ns

The “standard error” of x.

nsx

t

The “standard error” of x.

For an SRS sample, the one-sample t-statistic has the t-distribution with n-1 degrees of freedom.

(see Table D)

Page 5: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

t-distributions

t-distributions with k (=n-1) degrees of freedom – are labeled t(k), – are symmetric around 0, – and are bell-shaped – … but have more variability than Normal

distributions, due to the substitution of s in the place of .

Page 6: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 7: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 8: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 9: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Example: Estimating the level of vitamin C

Data:

26 31 23 22 11 22 14 31 Find a 95% confidence interval for . A: ( , ) Write it as “estimate plus margin of error”

STATA Exercise 1

Page 10: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 11: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 12: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercise 2

Page 13: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercise 2

Page 14: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercises 3 and 4

Page 15: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Paired, unpaired tests

“Paired” tests compare each individual between two variables and ask whether the mean difference (“gain” in this example) is zero.

Ho: mean(pretest - posttest) = mean(diff) = 0

STATA Exercise 5

Page 16: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 17: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 18: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercise 6

Page 19: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Robustness of t procedures

t-tests are only appropriate for testing a hypothesis on a single mean in these cases:– If n<15: only if the data is Normally distributed

(with no outliers or strong skewness)– If n≥15: only if there are no outliers or strong

skewness– If n≥40: even if clearly skewed (because of the

Central Limit Theorem)

Page 20: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

Page 21: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

Suppose we make a change to the registration procedure. Does this reduce the number of mistakes?

Basically, we’re looking at two populations: – the before-change population (population 1)– the after-change population (population 2)

Is the mean number of mistakes (per student) different? Is 1 – 2 = 0 or 0?

Page 22: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

Notice that we are not matching pairs. We compare two groups.

Page 23: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

Population Variable MeanStandard Deviation

1 x1 1 1

2 x2 2 2

Page 24: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

PopulationSample

SizeSample Mean

Sample Standard Deviation

1 n1 x1 s1

2 n2 x2 s2

Page 25: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

The population, really, is every single student using each registration procedure, an infinite number of times.– Suppose we get a “good” result today: how do we

know it will be repeated tomorrow? We can’t repeat the procedure an infinite

number of times, we only have a “sample”: numbers from one year.

We estimate (1 – 2) with (x1 – x2) .

Page 26: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Comparing Two Means

Remember is a Random Variable. To estimate we need both and the margin of error around , which is

So we need to know ,or rather, the appropriate standard error for this estimation.

Because we are estimating a difference, we need the standard error of a difference.

nt x*x

nx

xx

Page 27: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

=0

Comparing Two Means

If the standard error for is

Then the standard error for (x1 – x2) is

1

1

n

1x

2

22

1

21

nn

Page 28: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 29: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

2

22

1

21

2121

nn

xxt

STATA uses the Satterthwaite approximation as a default. This t* does not have a t-distribution because we are replacing two standard deviations by their sample equivalents.

Two-sample significance test

Page 30: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA uses the Satterthwaite approximation as a default. This t* does not have a t-distribution because we are replacing two standard deviations by their sample equivalents.

Page 31: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercise 7

Page 32: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Paired, unpaired tests

“Paired” tests compare each individual between two variables and ask whether the mean difference (“gain” in this example) is zero.

Ho: mean(pretest - posttest) = mean(diff) = 0 “Unpaired” tests take the mean of each variable and test

whether the difference of the means is zero.Ho: mean(pretest) - mean(posttest) = diff = 0

STATA Exercise 5

Page 33: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

STATA Exercise 8ttest ego, by(group) unequal

Page 34: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Robustness and Small Samples

Two-sample methods are more robust than one-sample methods.– More so if the two samples have similar shapes

and sample sizes. STATA assumes that the variances are the same (what

the book calls “pooled t procedures”), unless you tell it the opposite, using the unequal option.

Small samples, as always, make the test less robust.

Page 35: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Pooled two-sample t procedures

Page 36: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Pooled two-sample t procedures

Suppose the two Normal population distributions have the same standard deviation.

Then the t-statistic that compares the means of samples from those two populations has exactly a t-distribution.

Page 37: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

Pooled two-sample t procedures

The common, but unknown standard deviation of both populations is . The sample standard deviations s1 and s2 estimate .

The best way to combine these estimates is to take a “weighted average” of the two, using the dfs as the weights:

2

11

21

222

2112

nn

snsnsp

Page 38: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!

(assuming is the same for both populations)

21

11

nnsp

Here, t* is the value for the t(n1 + n2 – 2) density curve with area C between – t* and t*.

To test the hypothesis Ho: 1 = 2, compute the pooled two-sample t statistic

And use P-values from the t(n1 + n2 – 2) distribution.

21

21

11nn

s

xxt

p

THE POOLED TWO-SAMPLE T PROCEDURES

ttest ego, by(group)

Page 39: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
Page 40: For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!