more on sampling distributions and con dence intervals · 2020-05-03 · recall: sampling...

More on Sampling Distributions and ConfidenceIntervals

Jared S. MurrayThe University of Texas at Austin

McCombs School of Business

1

Recall: Sampling Distributions and Standard Errors

Sampling distributions describe how our estimates are likely to

change if we had seen slightly different data (a different sample

from the same population)

Large spread in the sampling distribution → low confidence that

our estimate – which is one random draw from this distribution – is

close to the true value (usually the mean, or close to the mean, of

this distribution)

An estimate’s standard error is the standard deviation (spread) of

its sampling distribution.

2

Estimating standard errors

3

Estimating standard errorsWe saw last week how to estimate sampling distributions &

standard errors using the bootstrap. This approach is useful,

general, and easy to implement.

For some important statistics, we can also directly calculate

estimates of standard errors, under some assumptions. This is

probably how you did it in your last stats class.

For example: The standard error of the sample mean y is

sy =

√σ2

n≈

√s2y

n

where n is the sample size, σ2 is the population variance of y , and

s2y is the sample variance of y

4

Example: AFC data

Let’s see how this works out in the AFC data.... (R script)

5

Bootstrap vs Direct estimation of standard errors

Why bother with the bootstrap? It’s more general, easy, often

makes fewer assumptions and works in cases where a mathematical

expression for the standard error is impossible to obtain

Why bother with direct estimation? Often faster to compute, and

tells us something about how the estimates behave (How does the

standard error of the sample mean change with the sample size?

With the spread of the data (population variance)?)

6

Normal approximations to sampling distributions

7


We’ve seem several examples where sampling distributions looked

approximately normal.

This is not a coincidence! For many statistics the sampling

distribution looks like a normal distribution, especially in large

samples – this is the Central Limit Theorem at work.

8


For example, for a sample mean, if n is large then

y ∼ N

(µ,σ2

n

)(approximately). For sample means, this approximation is quite

good.

9

Confidence intervals

10

Confidence Intervals

At a high level, confidence intervals give us a set of plausible

values for the quantity we’re trying to estimate.

What do we mean by plausible? Consistent with the data we

observe and what we expect the error to be in repeated samples

(i.e., the spread of the sampling distribution)

We have a few different ways to compute confidence intervals....

11

Confidence Intervals (Standard Error Method)

Consider estimating a confidence interval for the sample mean. We

have (approximately)

Y ∼ N(µ, s2

Y

)so our error has the distribution

(Y − µ) ∼ N(0, s2

Y

)I What is a good prediction for µ? What is our best guess?

Y

I How do we make mistakes? How far from µ might we be?

About 95% of the time our error is ±2× sY

I [Y ±2× sY ] gives a 95% confidence interval for µ. You

can think of this as a set of plausible values for µ 12

Confidence Intervals (Standard Error Method)

We can use a different critical value (number of standard errors)

to get a confidence interval with a different level than 95%.

We can either compute an estimate for the standard error using

the data directly – e.g. sy/√n for a sample mean – or using the

bootstrap.

13

Confidence Intervals (Percentile Method)

When the normal approximation to the sampling distribution is

good, a 90% confidence interval (for example) runs approximately

from the 5th to the 95th percentile of the bootstrap distribution.

When the normal approximation is bad, we can use percentiles of

the bootstrap distribution directly. (With some corrections; see

footnotes in DSGI Ch 5)

Often the percentile and standard error confidence intervals are

close; if they differ, reporting the largest one (or the union of the

two) is a reasonable thing to do.

14

Summary: 3 ways to estimate sampling distributions/CIs

There are three tools in our toolchest:

I The Central Limit Theorem: Assuming the estimator is

(approximately or asymptotically) unbiased, compute the

standard error & do calculations based on the normal

distribution

I Bootstrapped standard errors: Use the bootstrap to estimate

the standard error of the estimator & do calculations based on

the normal distribution (CLT)

I Percentile bootstrap: Use the bootstrap to estimate the

sampling distribution directly, and form confidence intervals

using quantiles of the estimated sampling distribution

These are all easy to do in R. (See examples in the R script)

Technically item 3 requires additional adjustments to be correct; see

footnotes in DGSI

15

Interpreting Confidence Intervals

However they’re constructed, the goal of a 100(1− α)%

confidence interval is to “cover” the true quantity when computed

in 100(1− α)% of datasets

(see p. 119 of DSGI and the module linked from this week’s post)

In any particular dataset, we can’t say whether the confidence

interval actually contains the true value. But over many analyses,

reporting a confidence interval as the set of plausible values means

that our intervals will usually contain the true value – far better

than reporting just the estimate!

16

Revisiting standard errors and confidence

intervals for regression models

17

Sampling Distribution of Regression Coefficients

Like the sample mean, regression coefficients estimated via least

squares have their own central limit theorem, so in large samples

their sampling distribution is approximately normal

We’ve seen how to bootstrap regression models; with some

assumptions we can directly estimate their standard errors too. In

particular, we assume:

I The residuals are independent of each other

I The residual standard deviation is constant (the spread of the

residuals doesn’t change with X , or other factors like time)

18

Sampling distribution of the slope

Can we intuit what should be in the formula for the standard error

of the slope, sβ1?

I How should the residual standard deviation se figure in the

formula?

I What about n?

I Anything else?

s2β1≈ s2

e∑(xi − x)2

=s2e

(n − 1)s2x

Three Factors:

sample size (n), residual variance (s2e ), and X -spread (sx).

19

Sampling distribution of the intercept

s2β0≈ s2

e

(1

n+

1

n − 1

(x

sx

)2)

Three Factors:

sample size (n), residual variance (s2e ), and the standardized

distance between x and zero:

x

sx=

x − 0

sx

20

Extracting these standard errors from lm

R script...

21

About that bootstrap...

Again: These formulas “work” when the residuals are independent

and have the same variance (i.e., the spread of values around the

regression line is constant)

The bootstrapped SEs don’t require the constant variance

assumption, and can be more appropriate if it seems to be violated

(see the AFC example in the R script.)

But if the constant variance assumption seems OK, the standard

errors/confidence intervals from lm will tend to be good in large

samples and/or when the residuals are approximately normal.

22

more on sampling distributions and con dence intervals · 2020-05-03 · recall: sampling...

Documents