bootstrap (part 3) › stats205 › lecture26 › pivoting.pdf · introduction i...

Post on 29-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bootstrap (Part 3)

Christof Seiler

Stanford University, Spring 2016, Stats 205

Overview

I So far we used three different bootstraps:I Nonparametric bootstrap on the rows (e.g. regression, PCA

with random rows and columns)I Nonparametric bootstrap on the residuals (e.g. regression)I Parametric bootstrap (e.g. PCA with fixed rows and columns)

I Today, we will look at some tricks to improve the bootstrap forconfidence intervals:

I Studentized bootstrap

IntroductionI A statistics is (asymptotically) pivotal if its limiting distribution

does not depend on unknown quantitiesI For example, with observations X1, . . . ,Xn from a normal

distribution with unknown mean and variance, a pivotalquantity is

T (X1, . . . ,Xn) =√

n(θ − θσ

)with unbiased estimates for sample mean and variance

θ = 1n

n∑i=1

Xi σ2 = 1n − 1

n∑i=1

(Xi − θ)2

I Then T (X1, . . . ,Xn) is a pivot following the Student’st-distribution with ν = n − 1 degrees of freedom

I Because the distribution of T (X1, . . . ,Xn) does not depend onµ or σ2

Introduction

I The bootstrap is better at estimating the distribution of apivotal statistics than at a nonpivotal statistics

I We will see an asymptotic argument using Edgeworthexpansions

I But first, let us look at an example

MotivationI Take n = 20 random exponential variables with mean 3

x = rexp(n,rate=1/3)

I Generate B = 1000 bootstrap samples of x , and calculate themean for each bootstrap sample

s = numeric(B)for (j in 1:B) { boot = sample(n,replace=TRUE)

s[j] = mean(x[boot]) }

I Form confidence interval from bootstrap samples usingquantiles (α = .025)

simple.ci = quantile(s,c(.025,.975))

I Repeat this process 100 timesI Check how often the intervals actually contains the true mean

Motivation

0 2 4 6 8

020

4060

8010

0bootstrap conf intervals

MotivationI Another way is to calculate a pivotal quantity as the

bootstrapped statisticI Calculate the mean and standard deviation

x = rexp(n,rate=1/3)mean.x = mean(x)sd.x = sd(x)

I For each bootstrap sample, calculate

z = numeric(B)for (j in 1:B) {

boot = sample(n,replace=TRUE)z[j] = (mean.x - mean(x[boot]))/sd(x[boot]) }

I Form a confidence interval like this

pivot.ci = mean.x + sd.x*quantile(z,c(.025,.975))

Motivation

0 2 4 6 8

020

4060

8010

0bootstrap conf intervals

Studentized Bootstrap

I Consider X1, . . . ,Xn from FI Let θ be an estimate of some θI Let σ2 be a standard error for θ estimated using the bootstrapI Most of the time as n grows

θ − θσ

.∼ N(0, 1)

I Let z(α) be the 100 · αth percentile of N(0, 1)I Then a standard confidence interval with coverage probability

1− 2α isθ ± z(1−α) · σ

I As n→∞, the bootstrap and standard intervals converge

Studentized Bootstrap

I How can we improve the standard confidence interval?I These intervals are valid under assumption that

Z = θ − θσ

.∼ N(0, 1)

I But this is only valid as n→∞I And are approximate for finite nI When θ is the sample mean, a better approximation is

Z = θ − θσ

.∼ tn−1

and tn−1 is the Student’s t distribution with n − 1 degrees offreedom

Studentized Bootstrap

I With this new approximation, we have

θ ± t(1−α)n−1 · σ

I As n grows the t distribution converges to the normaldistribution

I Intuitively, it widens the interval to account for unknownstandard error

I But, for instance, it does not account for skewness in theunderlying population

I This can happen when θ is not the sample meanI The Studentized bootstrap can adjust for such errors

Studentized Bootstrap

I We estimate the distribution of

Z = θ − θσ

.∼ ?

I by generating B bootstrap samples X ∗1,X ∗2, . . . ,X ∗B

I and computing

Z ∗b = θ∗b − θσ∗b

I Then the αth percentile of Z ∗b is estimated by the value t(α)

such that#{Z ∗b ≤ t(α)}

B = α

I Which yields the studentized bootstrap interval

(θ − t(1−α) · σ, θ − t(α) · σ)

Asymptotic Argument in Favor of Pivoting

I Consider parameter θ estimated by θ with variance 1nσ

2

I Take the pivotal statistics

S =√

n(θ − θσ

)

with estimate θ and asymptotic variance estimate σ2

I Then, we can use Edgeworth expansions

P(S ≤ x) = Φ(X ) +√

nq(x)φ(x) + O(√

n)

withΦ standard normal distribution,φ standard normal density, andq even polynomials of degree 2

Asymptotic Argument in Favor of Pivoting

I Bootstrap estimates are

S =√

n(θ∗ − θσ∗

)

I Then, we can use Edgeworth expansions

P(S∗ ≤ x |X1, . . . ,Xn) = Φ(X ) +√

nq(x)φ(x) + O(√

n)

I q is obtain by replacing unknowns in q with bootstrap estimatesI Asymptotically, we further have

q − q = O(√

n)

Asymptotic Argument in Favor of Pivoting

I Then, the bootstrap approximation to the distribution of S is

P(S ≤ x)− P(S∗ ≤ x |X1, . . . ,Xn) =(Φ(X )+

√nq(x)φ(x)+O(

√n))−(

Φ(X )+√

nq(x)φ(x)+O(√

n))

= O(1

n

)I Compared to the normal approximation

√n

I Which the same as the error when using standard bootstrap(can be shown with the same argument)

Studentized Bootstrap

I These pivotal intervals are more accurate in large samples thanthat of standard intervals and t intervals

I Accuracy comes at the cost of generalityI standard normal tables apply to all samples and all samples sizesI t tables apply to all samples of fixed nI studentized bootstrap tables apply only to given sample

I The studentized bootstrap can be asymmetricI It can be used for simple statistics, like mean, median, trimmed

mean, and sample percentileI But for more general statistics like the correlation coefficients,

there are some problems:I Interval can fall outside of allowable rangeI Computational issues if both parameter and standard error have

to be bootstrapped

Studentized Bootstrap

I The Studentized bootstrap works better for variance stabilizedparameters

I Consider a random variable X with mean θ and standarddeviation s(θ) that varies as a function of θ

I Using the delta method and solving an ordinary differentialequation, we can show that

g(x) =∫ x 1

s(u)du

will make the variance of g(X ) constantI Usually s(u) is unknownI So we need to estimate s(u) = se(θ|θ = u) using the bootstrap

Studentized Bootstrap1. First bootstrap θ, second bootstrap se(θ) from θ∗

2. Fit curve through points (θ∗1, se(θ∗1)), . . . , (θ∗B, se(θ∗B))3. Variance stabilization g(θ) by numerical integration4. Studentized bootstrap using g(θ∗)− g(θ)

(no denominator, since variance is now approximately one)5. Map back through transformation g−1

Source: Efron and Tibshirani (1994)

Studentized Bootstrap in R

library(boot)mean.fun = function(d, i) {

m = mean(d$hours[i])n = length(i)v = (n-1)*var(d$hours[i])/n^2c(m, v) }

air.boot <- boot(aircondit, mean.fun, R = 999)results = boot.ci(air.boot, type = c("basic", "stud"))

Studentized Bootstrap in R

results

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 999 bootstrap replicates#### CALL :## boot.ci(boot.out = air.boot, type = c("basic", "stud"))#### Intervals :## Level Basic Studentized## 95% ( 22.2, 171.2 ) ( 49.0, 303.0 )## Calculations and Intervals on Original Scale

References

I Efron (1987). Better Bootstrap Confidence IntervalsI Hall (1992). The Bootstrap and Edgeworth ExpansionI Efron and Tibshirani (1994). An Introduction to the BootstrapI Love (2010). Bootstrap-t Confidence Intervals (Link to blog

entry)

top related