don't be loopy: re-sampling and simulation the sas® way david l. cassell design pathways...
TRANSCRIPT
Don't Be Loopy: Re-Sampling Don't Be Loopy: Re-Sampling and Simulation the SAS® Wayand Simulation the SAS® Way
David L. CassellDavid L. Cassell
Design PathwaysDesign Pathways
Corvallis, ORCorvallis, OR
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
IntroductionIntroduction
BootstrappingBootstrapping
JackknifingJackknifing
Cross-validationCross-validation
SimulationsSimulations
Monte CarloMonte Carlo
… … and on and on…and on and on…
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
First, the BAD WAYFirst, the BAD WAY
The typical bootstrap code – a huge macro loopThe typical bootstrap code – a huge macro loop
SlowSlow
AwkwardAwkward
Very complex codeVery complex code
Log-fillingLog-filling
Output-cloggingOutput-clogging
Did I mention ‘slow’?Did I mention ‘slow’?
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The typical BAD bootstrap code The typical BAD bootstrap code – a huge macro loop– a huge macro loop
%do i = 1 %to &REPS ;%do i = 1 %to &REPS ;
%* steps to generate one data set;%* steps to generate one data set;
%* the proc to do the analysis;%* the proc to do the analysis;
%* some way of appending the new results;%* some way of appending the new results;
%end; %end;
%* a proc to compute the bootstrap estimates;%* a proc to compute the bootstrap estimates;
%mend;%mend;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Interlude – What is a bootstrap?Interlude – What is a bootstrap?
Types of Re-sampling:Types of Re-sampling:
Random drawsRandom draws
Designed subsetsDesigned subsets
Exchange labelsExchange labels
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Interlude – What is a bootstrap?Interlude – What is a bootstrap?
Want to approximate sampling distributionWant to approximate sampling distribution
Simple: SRS with replacement from original Simple: SRS with replacement from original samplesample
Non-parametric (mostly)Non-parametric (mostly)
Want: bias, std error, CI, or …Want: bias, std error, CI, or …
Assumptions: exchangeability, …Assumptions: exchangeability, …
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Interlude – What is a bootstrap?Interlude – What is a bootstrap?
We’ll start with the simple bootstrapWe’ll start with the simple bootstrap
Get a URS sample of size NGet a URS sample of size N
Compute your statisticCompute your statistic
Repeat B=1000 or 10,000 or … timesRepeat B=1000 or 10,000 or … times
Look at the behavior of your B valuesLook at the behavior of your B values
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Interlude – What is a bootstrap?Interlude – What is a bootstrap?
Warning: do not forget exchangeability!Warning: do not forget exchangeability!
The simple / naïve bootstrap doesn’t work The simple / naïve bootstrap doesn’t work right on:right on:
Time series dataTime series data
Repeated measures dataRepeated measures data
Survey sample data Survey sample data
Data with analytic weightsData with analytic weights
. . . . . . . .
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Interlude – What is a bootstrap?Interlude – What is a bootstrap?
A common approach is the bootstrap A common approach is the bootstrap percentile interval:percentile interval:
Take your B values from beforeTake your B values from before
Pull the 2.5Pull the 2.5thth and 97.5 and 97.5thth percentiles to get a percentiles to get a 95% percentile interval as your CI95% percentile interval as your CI
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The typical BAD bootstrap code The typical BAD bootstrap code – a huge macro loop– a huge macro loop
%macro bootie ( input=, reps= );%macro bootie ( input=, reps= );
%do i = 1 %to &REPS ;%do i = 1 %to &REPS ;
%* steps to generate one data set;%* steps to generate one data set;
%* the proc to do the analysis;%* the proc to do the analysis;
%* some way of appending the new results;%* some way of appending the new results;
%end; %end;
%* a proc to compute the bootstrap estimates;%* a proc to compute the bootstrap estimates;
%mend;%mend;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better BootstrapA Better Bootstrap
1. Generate ALL of the bootstrap samples 1. Generate ALL of the bootstrap samples as one data setas one data set
2. Use the same proc as before, but use by-2. Use the same proc as before, but use by-processingprocessing
3. Use the same computations to get the 3. Use the same computations to get the bootstrap estimatesbootstrap estimates
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better BootstrapA Better Bootstrap
proc surveyselect data=YourDataproc surveyselect data=YourData
out=outbootout=outboot
seed=30459584seed=30459584
method=urs method=urs
samprate=1samprate=1
outhitsouthits
rep=1000;rep=1000;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better BootstrapA Better Bootstrap
proc univariate data=outboot;proc univariate data=outboot;
var x;var x;
by Replicate;by Replicate;
output out=out1 q1=q1 median=med q3=q3;output out=out1 q1=q1 median=med q3=q3;
run;run;
data out2;data out2;
set out1;set out1;
trimean = (q1 + 2*med + q3) / 4;trimean = (q1 + 2*med + q3) / 4;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better BootstrapA Better Bootstrap
proc univariate data=out2;proc univariate data=out2;
var trimean;var trimean;
output out=finaloutput out=final
pctlpts=2.5, 97.5pctlpts=2.5, 97.5
pctlpre=ci;pctlpre=ci;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better Bootstrap – MoreA Better Bootstrap – More
sasfile YourData load;sasfile YourData load;
proc surveyselect data=YourData proc surveyselect data=YourData out=outbootout=outboot
seed=30459584seed=30459584
method=urs samprate=1 outhitsmethod=urs samprate=1 outhits
rep=1000;rep=1000;
run;run;
sasfile YourData close;sasfile YourData close;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better Bootstrap – MoreA Better Bootstrap – Moreods listing close;ods listing close;
proc univariate data=outboot;proc univariate data=outboot;
var x;var x;
by Replicate;by Replicate;
output out=out1 q1=q1 median=med q3=q3;output out=out1 q1=q1 median=med q3=q3;
run;run;
ods listing;ods listing;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
A Better Bootstrap – ODS A Better Bootstrap – ODS OUTPUTOUTPUT
ods output Modes=modal;ods output Modes=modal;
proc univariate data=outboot modes; proc univariate data=outboot modes;
var YourVariable;var YourVariable;
by Replicate;by Replicate;
run;run;
ods output close;ods output close;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
Simple bootstrap, as beforeSimple bootstrap, as before
Apply to: PROC REG, PROC LOGISTIC, ….Apply to: PROC REG, PROC LOGISTIC, ….
The approach can be criticized on several The approach can be criticized on several groundsgrounds
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
data test;data test;
x=1; y=45; output;x=1; y=45; output;
do x = 2 to 29;do x = 2 to 29;
y = 3*x + 6*rannor(1234);y = 3*x + 6*rannor(1234);
output;output;
end;end;
x=30; y=45; output;x=30; y=45; output;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
y
0
10
20
30
40
50
60
70
80
90
x
0 10 20 30
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
ods listing close;ods listing close;
proc surveyselect data=temp1 out=boot1 seed=38474proc surveyselect data=temp1 out=boot1 seed=38474
method=urs samprate=1 outhits rep=1000;method=urs samprate=1 outhits rep=1000;
run;run;
proc reg data=boot1 outest=est1(drop=_:);proc reg data=boot1 outest=est1(drop=_:);
model y=x;model y=x;
by replicate;by replicate;
run;run;
ods listing;ods listing;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
proc univariate data=est1;proc univariate data=est1;
var x;var x;
output out=final pctlpts=2.5, 97.5 output out=final pctlpts=2.5, 97.5
pctlpre=ci;pctlpre=ci;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
proc robustreg data=temp1 method=MM;proc robustreg data=temp1 method=MM;
model y=x;model y=x;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Case ResamplingCase Resampling
PROC REGPROC REG (1.74, 2.80)(1.74, 2.80)
bootstrap (case resampling)bootstrap (case resampling) (1.65, 2.90)(1.65, 2.90)
PROC ROBUSTREGPROC ROBUSTREG (2.39, 3.13)(2.39, 3.13)
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
Fit the modelFit the model
Bootstrap sample for the residualsBootstrap sample for the residuals
Add the randomly resampled e to Y-hatAdd the randomly resampled e to Y-hat
Fit the model for each of the B repsFit the model for each of the B reps
Compute bootstrap estimatesCompute bootstrap estimates
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
1 perform the regression, get Y-hat and e1 perform the regression, get Y-hat and e
2 split the data2 split the data
3 copy the FIT data set repeatedly3 copy the FIT data set repeatedly
4 URS sample of residuals for each replicate4 URS sample of residuals for each replicate
5 merge residuals with records5 merge residuals with records
6 fit the model on each replicate6 fit the model on each replicate
7 compute bootstrap estimates7 compute bootstrap estimates
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
proc reg data=test;proc reg data=test;
model y=x;model y=x;
output out=out1 p=yhat r=res;output out=out1 p=yhat r=res;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
data fit(keep=yhat x order) resid(keep=res);data fit(keep=yhat x order) resid(keep=res);
set out1;set out1;
order+1;order+1;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
proc surveyselect data=fit out=outfitproc surveyselect data=fit out=outfit
method=srs samprate=1 rep=1000; method=srs samprate=1 rep=1000;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
data outres2;data outres2;
do replicate = 1 to 1000;do replicate = 1 to 1000;
do order = 1 to numrecs;do order = 1 to numrecs;
p = ceil( numrecs * ranuni(394747373) );p = ceil( numrecs * ranuni(394747373) );
set resid nobs=numrecs point=p;set resid nobs=numrecs point=p;
output;output;
end;end;
end;end;
stop;stop;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
data prepped;data prepped;
merge outfit outres2;merge outfit outres2;
by replicate order;by replicate order;
new_y = yhat + res;new_y = yhat + res;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
proc reg data=preppedproc reg data=prepped
outest=est1( drop=_: );outest=est1( drop=_: );
model new_y = x;model new_y = x;
by replicate;by replicate;
run; run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Resampling residualsResampling residuals
proc univariate data=est1;proc univariate data=est1;
var x;var x;
output out=final pctlpts=2.5, 97.5output out=final pctlpts=2.5, 97.5
pctlpre=ci;pctlpre=ci;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
?The? Bootstrap??The? Bootstrap?
Simple bootstrapSimple bootstrap
Residual resamplingResidual resampling
Parametric bootstrapParametric bootstrap
Smooth bootstrapSmooth bootstrap
Wild bootstrapWild bootstrap
Double bootstrapDouble bootstrap
Various ‘adjusted’ bootstrapsVarious ‘adjusted’ bootstraps
. . . . .. . . . .
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The JackknifeThe Jackknife
Non-parametricNon-parametric
N systematic samples of size N-1N systematic samples of size N-1
Less general than the bootstrapLess general than the bootstrap
Easier to apply to complex sampling Easier to apply to complex sampling schemesschemes
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The JackknifeThe Jackknife
data outb;data outb;
do replicate = 1 to numrecs;do replicate = 1 to numrecs;
do rec = 1 to numrecs;do rec = 1 to numrecs;
set test nobs=numrecs point=rec; set test nobs=numrecs point=rec;
if replicate ^= rec then output;if replicate ^= rec then output;
end;end;
end;end;
stop;stop;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The JackknifeThe Jackknife
ods listing close;ods listing close;
proc univariate data=outb;proc univariate data=outb;
var y;var y;
by replicate; by replicate;
output out=outall kurtosis=curt; output out=outall kurtosis=curt;
run;run;
ods listing;ods listing;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
The JackknifeThe Jackknife
proc univariate data=outall;proc univariate data=outall;
var curt; var curt;
output out=final mean=jmean std=jstd; output out=final mean=jmean std=jstd;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Randomization TestsRandomization Tests
Resampling planResampling plan
Re-label the data points randomlyRe-label the data points randomly
Compare against originalCompare against original
Random subset of full permutation testRandom subset of full permutation test
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Cross-ValidationCross-Validation
Another type of resampling planAnother type of resampling plan
K replicate samplesK replicate samples
Each sample uses (K-1)/K to model and 1/K Each sample uses (K-1)/K to model and 1/K for testingfor testing
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Cross-ValidationCross-Validation
LOOCV – Leave-One-Out Cross-ValidationLOOCV – Leave-One-Out Cross-Validation
K-fold Cross-ValidationK-fold Cross-Validation
Random K-fold Cross-ValidationRandom K-fold Cross-Validation
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Random K-Fold Cross-ValidationRandom K-Fold Cross-Validation
%let K=10;%let K=10;
%let rate= %sysevalf( (&K-1) / &K );%let rate= %sysevalf( (&K-1) / &K );
proc surveyselect data=temp1 out=xv seed=495857proc surveyselect data=temp1 out=xv seed=495857
samprate=&RATE outall rep=&K ;samprate=&RATE outall rep=&K ;
run;run;
data xv;data xv;
set xv;set xv;
if selected then new_y=y;if selected then new_y=y;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Random K-Fold Cross-ValidationRandom K-Fold Cross-Validation
proc reg data=xv;proc reg data=xv;
model new_y=x;model new_y=x;
by replicate;by replicate;
output out=out1(where=(new_y=.)) p=yhat;output out=out1(where=(new_y=.)) p=yhat;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Random K-Fold Cross-ValidationRandom K-Fold Cross-Validation
data out2;data out2;
set out1;set out1;
d=y-yhat;d=y-yhat;
absd=abs(d);absd=abs(d);
run;run;
proc summary data=out2;proc summary data=out2;
var d absd;var d absd;
output out=out3 std(d)=rmse mean(absd)=mae;output out=out3 std(d)=rmse mean(absd)=mae;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Monte Carlo SimulationsMonte Carlo Simulations
Sample from theoretical distributionsSample from theoretical distributions
Sample from population of data pointsSample from population of data points
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
SimulationsSimulations
proc surveyselect data=largefile out=process_setproc surveyselect data=largefile out=process_set
seed=45884743 method=srs sampsize=1000; seed=45884743 method=srs sampsize=1000;
run;run;
data processor;data processor;
array{5,5} a1-a25;array{5,5} a1-a25;
set process_set;set process_set;
. . . . . . . . . .
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
SimulationsSimulations
proc plan seed=4958584;proc plan seed=4958584;
factors replicate=100 orderedfactors replicate=100 ordered
SiteNo = 30 of 200 / noprint;SiteNo = 30 of 200 / noprint;
output out=plan9;output out=plan9;
run;run;
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
CONCLUSIONSCONCLUSIONS
Cassell’s “7 Habits of Highly Effective SAS-ers”Cassell’s “7 Habits of Highly Effective SAS-ers”
KNOW YOUR PROBLEMKNOW YOUR PROBLEM USE THE RIGHT TOOLUSE THE RIGHT TOOL FEWER STEPS GET YOU FARTHERFEWER STEPS GET YOU FARTHER STAY TALL AND THINSTAY TALL AND THIN TOO MUCH OF A GOOD THING IS BADTOO MUCH OF A GOOD THING IS BAD SKIP THE EXPENSIVE STUFFSKIP THE EXPENSIVE STUFF SHARPEN THE SAWSHARPEN THE SAW
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
CONCLUSIONSCONCLUSIONS
SAS is great at resampling and simulations.SAS is great at resampling and simulations.
You just have to code it in SAS instead of You just have to code it in SAS instead of something else!something else!
Don’t run 5003 steps when 3 steps will do it.Don’t run 5003 steps when 3 steps will do it.
Don’t assume everything is a macro Don’t assume everything is a macro problem.problem.
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
CONCLUSIONSCONCLUSIONS
Resampling methods and simulations do not Resampling methods and simulations do not solve all your problems.solve all your problems.
Use your brain before you use your Use your brain before you use your keyboard.keyboard.
David L. Cassell, Design PathwaysDavid L. Cassell, Design Pathways
Contact InformationContact Information
David L. CassellDavid L. Cassell
Design PathwaysDesign Pathways
3115 NW Norwood Pl.3115 NW Norwood Pl.
Corvallis, OR 97330Corvallis, OR 97330
[email protected]@msn.com
541-754-1304541-754-1304