more on sampling distributions and con dence intervals · 2020-05-03 · recall: sampling...

22
More on Sampling Distributions and Confidence Intervals Jared S. Murray The University of Texas at Austin McCombs School of Business 1

Upload: others

Post on 14-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

More on Sampling Distributions and ConfidenceIntervals

Jared S. MurrayThe University of Texas at Austin

McCombs School of Business

1

Page 2: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Recall: Sampling Distributions and Standard Errors

Sampling distributions describe how our estimates are likely to

change if we had seen slightly different data (a different sample

from the same population)

Large spread in the sampling distribution → low confidence that

our estimate – which is one random draw from this distribution – is

close to the true value (usually the mean, or close to the mean, of

this distribution)

An estimate’s standard error is the standard deviation (spread) of

its sampling distribution.

2

Page 3: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Estimating standard errors

3

Page 4: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Estimating standard errorsWe saw last week how to estimate sampling distributions &

standard errors using the bootstrap. This approach is useful,

general, and easy to implement.

For some important statistics, we can also directly calculate

estimates of standard errors, under some assumptions. This is

probably how you did it in your last stats class.

For example: The standard error of the sample mean y is

sy =

√σ2

n≈

√s2y

n

where n is the sample size, σ2 is the population variance of y , and

s2y is the sample variance of y

4

Page 5: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Example: AFC data

Let’s see how this works out in the AFC data.... (R script)

5

Page 6: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Bootstrap vs Direct estimation of standard errors

Why bother with the bootstrap? It’s more general, easy, often

makes fewer assumptions and works in cases where a mathematical

expression for the standard error is impossible to obtain

Why bother with direct estimation? Often faster to compute, and

tells us something about how the estimates behave (How does the

standard error of the sample mean change with the sample size?

With the spread of the data (population variance)?)

6

Page 7: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Normal approximations to sampling distributions

7

Page 8: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Normal approximations to sampling distributions

We’ve seem several examples where sampling distributions looked

approximately normal.

This is not a coincidence! For many statistics the sampling

distribution looks like a normal distribution, especially in large

samples – this is the Central Limit Theorem at work.

8

Page 9: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Normal approximations to sampling distributions

For example, for a sample mean, if n is large then

y ∼ N

(µ,σ2

n

)(approximately). For sample means, this approximation is quite

good.

9

Page 10: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Confidence intervals

10

Page 11: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Confidence Intervals

At a high level, confidence intervals give us a set of plausible

values for the quantity we’re trying to estimate.

What do we mean by plausible? Consistent with the data we

observe and what we expect the error to be in repeated samples

(i.e., the spread of the sampling distribution)

We have a few different ways to compute confidence intervals....

11

Page 12: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Confidence Intervals (Standard Error Method)

Consider estimating a confidence interval for the sample mean. We

have (approximately)

Y ∼ N(µ, s2

Y

)so our error has the distribution

(Y − µ) ∼ N(0, s2

Y

)I What is a good prediction for µ? What is our best guess?

Y

I How do we make mistakes? How far from µ might we be?

About 95% of the time our error is ±2× sY

I [Y ±2× sY ] gives a 95% confidence interval for µ. You

can think of this as a set of plausible values for µ 12

Page 13: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Confidence Intervals (Standard Error Method)

We can use a different critical value (number of standard errors)

to get a confidence interval with a different level than 95%.

We can either compute an estimate for the standard error using

the data directly – e.g. sy/√n for a sample mean – or using the

bootstrap.

13

Page 14: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Confidence Intervals (Percentile Method)

When the normal approximation to the sampling distribution is

good, a 90% confidence interval (for example) runs approximately

from the 5th to the 95th percentile of the bootstrap distribution.

When the normal approximation is bad, we can use percentiles of

the bootstrap distribution directly. (With some corrections; see

footnotes in DSGI Ch 5)

Often the percentile and standard error confidence intervals are

close; if they differ, reporting the largest one (or the union of the

two) is a reasonable thing to do.

14

Page 15: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Summary: 3 ways to estimate sampling distributions/CIs

There are three tools in our toolchest:

I The Central Limit Theorem: Assuming the estimator is

(approximately or asymptotically) unbiased, compute the

standard error & do calculations based on the normal

distribution

I Bootstrapped standard errors: Use the bootstrap to estimate

the standard error of the estimator & do calculations based on

the normal distribution (CLT)

I Percentile bootstrap: Use the bootstrap to estimate the

sampling distribution directly, and form confidence intervals

using quantiles of the estimated sampling distribution

These are all easy to do in R. (See examples in the R script)

Technically item 3 requires additional adjustments to be correct; see

footnotes in DGSI

15

Page 16: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Interpreting Confidence Intervals

However they’re constructed, the goal of a 100(1− α)%

confidence interval is to “cover” the true quantity when computed

in 100(1− α)% of datasets

(see p. 119 of DSGI and the module linked from this week’s post)

In any particular dataset, we can’t say whether the confidence

interval actually contains the true value. But over many analyses,

reporting a confidence interval as the set of plausible values means

that our intervals will usually contain the true value – far better

than reporting just the estimate!

16

Page 17: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Revisiting standard errors and confidence

intervals for regression models

17

Page 18: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Sampling Distribution of Regression Coefficients

Like the sample mean, regression coefficients estimated via least

squares have their own central limit theorem, so in large samples

their sampling distribution is approximately normal

We’ve seen how to bootstrap regression models; with some

assumptions we can directly estimate their standard errors too. In

particular, we assume:

I The residuals are independent of each other

I The residual standard deviation is constant (the spread of the

residuals doesn’t change with X , or other factors like time)

18

Page 19: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Sampling distribution of the slope

Can we intuit what should be in the formula for the standard error

of the slope, sβ1?

I How should the residual standard deviation se figure in the

formula?

I What about n?

I Anything else?

s2β1≈ s2

e∑(xi − x)2

=s2e

(n − 1)s2x

Three Factors:

sample size (n), residual variance (s2e ), and X -spread (sx).

19

Page 20: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Sampling distribution of the intercept

s2β0≈ s2

e

(1

n+

1

n − 1

(x

sx

)2)

Three Factors:

sample size (n), residual variance (s2e ), and the standardized

distance between x and zero:

x

sx=

x − 0

sx

20

Page 21: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

Extracting these standard errors from lm

R script...

21

Page 22: More on Sampling Distributions and Con dence Intervals · 2020-05-03 · Recall: Sampling Distributions and Standard Errors Sampling distributions describe how our estimates are likely

About that bootstrap...

Again: These formulas “work” when the residuals are independent

and have the same variance (i.e., the spread of values around the

regression line is constant)

The bootstrapped SEs don’t require the constant variance

assumption, and can be more appropriate if it seems to be violated

(see the AFC example in the R script.)

But if the constant variance assumption seems OK, the standard

errors/confidence intervals from lm will tend to be good in large

samples and/or when the residuals are approximately normal.

22