Download - BOOTSTRAPPING LINEAR MODELS
11Stat 6601 Presentation
Presented by:
Xiao Li (Winnie)Wenlai Wang
Ke Xu
Nov. 17, 2004
V & R 6.6
22
Preview of the Preview of the PresentationPresentation
11/17/2004
Bootstrapping Linear Models
Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results Issues and Discussion Summary
33
What is Bootstrapping ?What is Bootstrapping ?
11/17/2004
Bootstrapping Linear Models
Invented by Bradley Efron, and further developed by Efron and Tibshirani
A method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample
A method to determine the trustworthiness of a statistic (generalization of the standard deviation)
44
Why uses Bootstrapping ?Why uses Bootstrapping ?
11/17/2004
Bootstrapping Linear Models
Start with 2 questions: What estimator should be used? Having chosen an estimator, how
accurate is it?
Linear Model with normal random errors having constant variance Least Square
Generalized non-normal errors and non-constant variance ???
55
The Mammals DataThe Mammals Data
11/17/2004
Bootstrapping Linear Models
A data frame with average brain and body weights for 62 species of land mammals.
“body” : Body weight in Kg “brain” : Brain weight in g “name”: Common name of species
66
Data and ModelData and Model
11/17/2004
Bootstrapping Linear Models
0 2000 5000
010
0020
0030
0040
0050
00
Original Data
body weight
brai
n w
eigh
t
-4 0 2 4 6 8
-20
24
68
Log-Transformed Data
log body weight
log
brai
n w
eigh
t
Linear Regression Model:
where j = 1, …, n, and
is considered random
y = log(brain weight)
x = log(body weight)
77
Summary of Original FitSummary of Original Fit
11/17/2004
Bootstrapping Linear Models
Residuals:
Min 1Q Median 3Q Max
-1.71550 -0.49228 -0.06162 0.43597 1.94829
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.13479 0.09604 22.23 <2e-16 ***
log(body) 0.75169 0.02846 26.41 <2e-16 ***
Residual standard error: 0.6943 on 60 DF
Multiple R-Squared: 0.9208
Adjusted R-squared: 0.9195
F-statistic: 697.4 on 1 and 60 DF
p-value: < 2.2e-16
88
for Original for Original ModelingModeling
11/17/2004
Bootstrapping Linear Models
library(MASS)
library(boot)
c <- par(mfrow=c(1,2))
data <- data(mammals)
plot(mammals$body, mammals$brain, main='Original Data', xlab='body weight', ylab='brain weight', col=’brown’) # plot of data
plot(log(mammals$body), log(mammals$brain), main='Log-Transformed Data', xlab='log body weight', ylab='log brain weight', col=’brown’) # plot of log-transformed data
mammal <- data.frame(log(mammals$body), log(mammals$brain))
dimnames(mammal) <- list((1:62), c("body", "brain"))
attach(mammal)
log.fit <- lm(brain~body, data=mammal)
summary(log.fit)
99
Two MethodsTwo Methods
11/17/2004
Bootstrapping Linear Models
Case-based Resampling: randomly sample pairs (Xi, Yi) with replacement
No assumption about variance homogeneity Design fixes the information content of a sample
Model-based Resampling: resample the residuals
Assume model is correct with homoscedastic errors Resampling model has the same “design” as the data
1010
Case-Based Resample Case-Based Resample AlgorithmAlgorithm
11/17/2004
Bootstrapping Linear Models
For r = 1, …, R,1. sample randomly with replacement
from {1, 2, …,n}
2. for j = 1, …, n, set , then
3. fit least squares regression to , …,
giving estimates , , .
1111
Model-Based Resample Model-Based Resample AlgorithmAlgorithm
11/17/2004
Bootstrapping Linear Models
For r = 1, …, n,1. For j = 1, … , n,
a) Setb) Randomly sample from , …, ; thenc) Set
1. Fit least squares regression to ,…, giving estimates , , .
1212
Case-Based Bootstrap Case-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
ORDINARY NONPARAMETRIC BOOTSTRAP
Bootstrap Statistics :
original bias std. error
t1* 2.134789 -0.0022155790 0.08708311
t2* 0.751686 0.0001295280 0.02277497
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Intervals :
Level Normal Percentile BCa
95% ( 1.966, 2.308 ) ( 1.963, 2.310 ) ( 1.974, 2.318 )
95% ( 0.7069, 0.7962 ) ( 0.7082, 0.7954 ) ( 0.7080, 0.7953 )
Calculations and Intervals on Original Scale
1313
Case-Based Bootstrap Case-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
Histogram of t
t*
De
nsi
ty
0.70 0.80
05
10
15
20
-3 -1 1 3
0.7
00
.75
0.8
0
Quantiles of Standard Normal
t*
Histogram of t
t*
De
nsi
ty
1.8 2.0 2.2 2.4
01
23
45
-3 -1 1 3
1.9
2.0
2.1
2.2
2.3
2.4
Quantiles of Standard Normal
t*
Bootstrap Distribution Plots for intercept and Slope
1414
Case-Based Bootstrap Case-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
Standardized Jackknife-after-Bootstrap Plots for intercept and Slope
-2 -1 0 1 2 3
-0.0
6-0
.04
-0.0
20
.00
0.0
20
.04
standardized jackknife value
5, 1
0, 1
6, 5
0, 8
4, 9
0, 9
5 %
-ile
s o
f (T
*-t)
** *
** ** *
* ***
***
*
*****
******
***
*************
*********
**** ** * * * *
** ** * ** **
**********
************
*****************
******** *
* * * * *
* * ** * ** ********
******************
****************
******* ** * * * *
* * ** * ** ** *********************************************** ** * * * *
* * ** * ** ** *******
*************************
*************** *
* * * * ** * *
** ** ** *
***********
********************
******
*****
**** *
** * * *
* * ** *
**
** **
********
**********
*********
*******
*******
**** ** * * *
*
11 54 37 10 51 425318 60 40 52 48 55
4 2 33 28 44 43 5 29 12 30 14 46 32
58 36 31 45 25 1 17 7 62 39 35 38
56 26 22 21 6 418 16 9 59 47 19
49 61 13 27 50 15 3 24 57 34 23 20
-2 -1 0 1 2
-0.2
-0.1
0.0
0.1
0.2
standardized jackknife value
5, 1
0, 1
6, 5
0, 8
4, 9
0, 9
5 %
-ile
s o
f (T
*-t)
* * * ****** * ****
**** **
********************
*** *
***
**********
** ** *
** * *****
* ****** ***
************************ * *
************
* ** ** ** * *
****** * ***** *** *************
*********** * *************
* ** ** *
* * * ****** * ***** *** ************************ * ************** ** ** *
* * * ****** * ***** *** ***
******************
***
* ************** **** *
* * * ****** ****** *** **
*********************
* * *******
******* **** *
* **
****** * ****
* *** *******************
***** *
****
********** ** *
**
34 52 57 37 60291522 41 10 619 11
55 20 43 17 121421 5 39 7 27 1 2
59 38 13 28 45193 50 25 46 32 24
48 49 18 4484036 53 30 31 51 47
23 16 4 5658336 42 62 26 54 35
1515
for Case-for Case-Based Based
11/17/2004
Bootstrapping Linear Models
# Case-Based Resampling
fit.case <- function(data) coef(lm(log(data$brain)~log(data$body)))
mam.case <- function(data, i) fit.case(data[i, ])
mam.case.boot <- boot(mammals, mam.case, R = 999)
mam.case.boot
boot.ci(mam.case.boot, type=c("norm", "perc", "bca"))
boot.ci(mam.case.boot, index=2, type=c("norm", "perc", "bca"))
plot(mam.case.boot)
plot(mam.case.boot, index=2)
jack.after.boot(mam.case.boot)
jack.after.boot(mam.case.boot, index=2)
1616
Model-Based Bootstrap Model-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
ORDINARY NONPARAMETRIC BOOTSTRAP
Bootstrap Statistics :
original bias std. error
t1* 2.134789 0.0049756072 0.09424796
t2* 0.751686 -0.0006573983 0.02719809
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Intervals :
Level Normal Percentile Bca
95% ( 1.945, 2.315 ) ( 1.948, 2.322 ) ( 1.941, 2.316 )
95% ( 0.6990, 0.8057 ) ( 0.6982, 0.8062 ) ( 0.6987, 0.8077 )
Calculations and Intervals on Original Scale
1717
Model-Based Bootstrap Model-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
Histogram of t
t*
De
nsi
ty
1.9 2.1 2.3 2.5
01
23
4
-3 -1 1 3
1.9
2.0
2.1
2.2
2.3
2.4
2.5
Quantiles of Standard Normal
t*
Histogram of t
t*
De
nsi
ty
0.70 0.80
05
10
15
-3 -1 1 3
0.7
00
.75
0.8
00
.85
Quantiles of Standard Normal
t*
Bootstrap Distribution Plots for intercept and Slope
1818
Model-Based Bootstrap Model-Based Bootstrap
11/17/2004
Bootstrapping Linear Models
Standardized Jackknife-after-Bootstrap Plots for intercept and Slope
-2 -1 0 1 2
-0.3
-0.2
-0.1
0.0
0.1
0.2
standardized jackknife value
5, 1
0, 1
6, 5
0, 8
4, 9
0, 9
5 %
-ile
s o
f (T
*-t)
* ****
*********
*******
*****
**
*********
***** ***
******** *
** *
** * * *
***** ***
******* *****
****
********
********** ***
******** *
** *
** * * ** *
*** *
********* ************
*************** ***
******** *** * ** * * *
* **** ********** ***************************
*********** *** * ** * * *
* **** *****
***** ******
*******
************** **********
* *** * ** * **
***
** *
*
******** ***
*****
********
*******
**** ***********
*** * ** * * *
* **** ********** ******
*****************
********
***
******
* ***
* **
34 48 58 18 14 60 41 6 19 1 11 54 35
59 52 45 38 13 21 2815 50 30 26 9 32
36 49 12 8 43 22 3344 62 57 25 46
4 16 20 37 3 5 40 53 31 27 10 47
56 39 55 23 29 42 7 61 17 51 24 2
-2 -1 0 1 2
-0.0
50
.00
0.0
5
standardized jackknife value
5, 1
0, 1
6, 5
0, 8
4, 9
0, 9
5 %
-ile
s o
f (T
*-t)
**
*** * **
********* *
*******
***
*** ***** ******* ****
******** *
* **** *
**
*** * ******
***** *
********
** *
** ***** ******* ****
********
** **** *
* * ***
*******
***** *
********** *** ***** ******* *******
***** ** ***
**
* * *** * *********** *********** *** ***** ******* ************ ** **** *
**
***
********
****
*********** **
*****
****
**** **********
**** **
** ***
*** *
*********** *******
**** *** *
****
*******
**********
**** **
** **
****
*******
***** ***
********
*** *****
*******
*
***
********
**
****
*
17 57 4844 10 42 24 50 26 16 46 15 14
9 33 4945 19 7 20 47 41 60 59 61 39
32 54 51 27 62 13 8 1 40 29 21 58
3 28 31 30 53 23 55 43 4 12 6 34
38 18 2 52 22 37 25 56 35 11 5 36
1919
for Model-Basedfor Model-Based
11/17/2004
Bootstrapping Linear Models
# Model-Based Resampling (Resample Residuals)
fit.res <- lm(brain ~ body, data=mammal)
mam.res.data <- data.frame(mammal, res=resid(fit.res), fitted=fitted(fit.res))
mam.res <- function(data, i){
d <- data
d$brain <- d$fitted + d$res[i]
coef(update(fit.res, data=d))
}
fit.res.boot <- boot(mam.res.data, mam.res, R = 999)
fit.res.boot
boot.ci(fit.res.boot, type=c("norm", "perc", "bca"))
boot.ci(fit.res.boot, index=2, type=c("norm", "perc", "bca"))
plot(fit.res.boot)
plot(fit.res.boot, index=2)
boot.ci(fit.res.boot, type=c("norm", "perc", "bca"))
jack.after.boot(fit.res.boot)
jack.after.boot(fit.res.boot, index=2)
2020
Comparisons and Comparisons and DiscussionDiscussion
11/17/2004
Bootstrapping Linear Models
Comparing
Fields
Original Model
Case-Based
(Fixed)
Model-Bsed
(Random)
Intercept (t1*)
Stand Error
2.13479 0.09604
2.134789 0.08708311
2.134789 0.09424796
Slope (t2*)
Stand Error
0.75169 0.02846
0.751686 0.02277497
0.751686 0.02719809
2121
Case-Based Vs. Model-Case-Based Vs. Model-BasedBased
11/17/2004
Bootstrapping Linear Models
Model-based resampling enforces the assumption that errors are randomly distributed by resampling the residuals from a common distribution
If the model is not specified correctly – i.e., unmodeled nonlinearity, non-constant error variance, or outliers – these attributes do not carry over to the bootstrap samples
The effects of outliers is clear in the
case-based, but not with the model-based.
2222
When Might Bootstrapping When Might Bootstrapping Fail?Fail?
11/17/2004
Bootstrapping Linear Models
Incomplete Data Assume that missing data are not problematic If multiple imputation is used beforehand
Dependent Data Bootstrap imposes mutual dependence on the Yj,
and thus their joint distribution is
Outliers and Influential Cases Remove/Correct obvious outliers Avoid the simulations to depend on particular observations
2323
Review & More Review & More ResamplingResampling
11/17/2004
Bootstrapping Linear Models
Resampling techniques are powerful tools for:-- estimating SD from small samples-- when the statistics do not have easily determined SD
Bootstrapping involves:-- taking ‘new’ random samples with replacement from the original data-- calculate boostrap SD and statistical test from the average of the statistic from the bootstrap samples
More resampling techniques:-- Jackknife resampling-- Cross-validation
2424
SUMMARYSUMMARY
11/17/2004
Bootstrapping Linear Models
Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results and Comparisons Issues and Discussion
2525
ReferenceReference
11/17/2004
Bootstrapping Linear Models
Anderson, B. “Resampling and Regression” McMaster University. http://socserv.mcmaster.ca/anderson
Davision, A.C. and Hinkley D.V. (1997) Bootstrap methods and their application. pp.256-273. Cambridge University Press
Efron and Gong (February 1983), A Leisurely Look at the Bootstrap, the Jackknife, and Cross Validation, The American Statistician.
Holmes, S. “Introduction to the Bootstrap” Stanford University. http://wwwstat.stanford.edu/~susan/courses/s208/
Venables and Ripley (2002), Modern Applied Statistics with S, 4th ed. pp. 163-165. Springer
262611/17/20
04Bootstrapping Linear
Models
2727
Extra Stuff…Extra Stuff…
11/17/2004
Bootstrapping Linear Models
Jackknife Resampling takes new samples of the data by omitting each case individually and recalculating the statistic each time Resampling data by randomly taking a single observation out # of jackknife samples used # of cases in the original sample Works well for robust estimators of location, but not for SD
Cross-Validation randomly splits the sample into two groups comparing the model results from one sample to the results from the other. 1st subset is used to estimate a statistical model
(screening/training sample) Then test our findings on the second subset.
(confirmatory/test sample)