using bootstrap in nlscy

65
Using Bootstrap in NLSCY

Upload: piera

Post on 14-Jan-2016

42 views

Category:

Documents


1 download

DESCRIPTION

Using Bootstrap in NLSCY. Today’s Presentation. B O O T S T R A P. We’ll discuss the guiding principles We’ll demonstrate the CV lookup spreadsheet (which is based on thebootstrap weights). Bootstrap macros by example Summarize some technical aspects. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Bootstrap in NLSCY

Using Bootstrap in NLSCY

Page 2: Using Bootstrap in NLSCY

Today’s Presentation

B O O T S T R A P• We’ll discuss the

guiding principles• We’ll demonstrate the

CV lookup spreadsheet (which is based on the

bootstrap weights)

• Bootstrap macros by example

• Summarize some technical aspects.

Page 3: Using Bootstrap in NLSCY

Background

The National Longitudinal Survey of Children and Youth measures a wide array of characteristics related to child and youth development

There are many opportunities for statistical inference

The number of possibilities is further compounded by the longitudinal character of the survey

A basic problem of inference is finding the variability of the estimators.

Page 4: Using Bootstrap in NLSCY

The bootstrap approach

• Does not need exact formulas.• Takes into account design information.• It can be adapted to the desired level of precision.• Computer intensive

Page 5: Using Bootstrap in NLSCY

Basic Idea of BootstrapA) Take a subsample of the original sample - trying to mimic

the initial selection process.

B) For this subsample compute weights as if it was the actual sample. The result is a bootstrap weight.

Repeat A) and B) many times to obtain a set of bootstrap weights

Note that both A) and B) make essential use of the design information

Page 6: Using Bootstrap in NLSCY

Basic Idea of Bootstrap - continuedNow suppose we are interested in an estimate;

- Compute the estimate using each of the bootstrap weights

- Compute the variance of the obtained points.

Note: These two steps are implemented in any program or software that uses bootstrap weights to assess sampling variability.

Page 7: Using Bootstrap in NLSCY

The Need to Use the Design Information

Using the release weights gives the correct estimates.

However, the variance of the estimator provided by SAS or SPSS is not the real one - most of the time it is less.

Here are the comparisons for two examples:AverageRegression Coefficients

Page 8: Using Bootstrap in NLSCY

Using Bootstrap

We have two tools at hand:

•A database of variances for proportions - already computed by bootstrap

•The bootstrap macros

Page 9: Using Bootstrap in NLSCY

Results for Proportions

For the variability of proportion estimates we can use an Excel table of already computed results. The work has been done using bootstrap.

This table replaces the usual look-up tables for variance.

One can choose the domain based on age and province.

Page 10: Using Bootstrap in NLSCY

Results for Proportions - continued

This general framework allows for estimating variability of proportions in future cycles of the survey.

In most situations where proportions are involved, consulting this database may be enough.

Here are examples on how to use it.

skip

Page 11: Using Bootstrap in NLSCY

Understanding the tableExample 1

Question: What is the quality (c.v) of the estimates for the proportion of girls aged 3 in Newfoundland at cycle 3?

How many will there be in cycle 5?

Will the quality suffer from the smaller sample size?

Page 12: Using Bootstrap in NLSCY

Click on the right arrow in Province to select a province

intro/skip

Page 13: Using Bootstrap in NLSCY

Select Newfoundland (Terre-Neuve)

Page 14: Using Bootstrap in NLSCY

Click to select C3 Age = 3

Page 15: Using Bootstrap in NLSCY

Since the proportion of girls should be around 50%, click on Prop. Cible and

select 50%.

Page 16: Using Bootstrap in NLSCY

The remaining rows contain the results that interest us ...

Page 17: Using Bootstrap in NLSCY

You can now see that the c.v. for that particular domain incycle 3 was 17.5% with 44 children in the sample.

In cycle 5, we predict 35 children will be left in sample(assuming 90% response rate in cycle 4 and 5) and the c.v. will grow to 19.6%.

intro/sauter

Page 18: Using Bootstrap in NLSCY

Understanding the tableExample 2

Question: What domains based on a 15% proportion are not publishable?

We are looking for domains with a c.v. higher

than 33.33%

Page 19: Using Bootstrap in NLSCY

Click and select Prop. cible of 15%

Page 20: Using Bootstrap in NLSCY

Click and select Custom in bs_cv

Page 21: Using Bootstrap in NLSCY

Select “is greater than” in the first field

Page 22: Using Bootstrap in NLSCY

Finally, type in 33.33 in the second fieldand click OK

Page 23: Using Bootstrap in NLSCY

You can now see the first few rows of estimates that we can’t release according to

customary quality level guidelines.

Page 24: Using Bootstrap in NLSCY

Results for Proportions - Summary

The table contains variance estimations obtained by bootstrap - under general conditions.

It is best suited for quick general assessment of variance and projections for future cycles.

When we need the most accurate variance estimation we have to do the bootstrap for the specific variable of interest.

intro

Page 25: Using Bootstrap in NLSCY

Macros - Outline

• Bootstrap weights are computed and made available by methodology.

• The user runs the macros.

Page 26: Using Bootstrap in NLSCY

Macros - Details

• Preparing the input

• Specifying the options and running the macros

• Saving and interpreting the results

Page 27: Using Bootstrap in NLSCY

Preparing the input

• Two input files are required:

– The bootstrap weights file

– The file with variables of interest

• These files must be merged - ususally with the CHILDID identifier

Page 28: Using Bootstrap in NLSCY

Specifying optionsThe options to be specified are as follows:

(i) The kind of estimator.

(ii) Whether the analysis is done globally or by domains.

(iii) SAS libraries.

(iv) The names of the variables for analysis.

(v) The number of bootstrap weights to be used.

Page 29: Using Bootstrap in NLSCY

Specifying options - continued(i) The built-in choices - in the current version - are:

Other estimators may require customizing the code

(ii) If analysis by subgroup is desired, the user needs to specify the subgroup variable.

Totals

Ratios

Difference of RatiosLogistic Regression

Linear Regression

Page 30: Using Bootstrap in NLSCY

Examples with SAS Macros

a) Estimate variance of a total by region

b) Estimate the variance of an average

c) Estimate the variance of regression coefficients

Page 31: Using Bootstrap in NLSCY

Estimate variance of a total by region

Problem:

Find the variance of the total number of bedrooms in

households with teenagers within each province - as

estimated from the sample.

Page 32: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued/*

%partition(domains=); *no partition if no variable name provided;

%total(dataset=,variable=,nb_weights=);

COLLECT OUTPUT FROM DATASET: totals

%ratio(dataset=,numerator=,denominator=,nb_weights=);

COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -

%ratio_difference(dataset=,numerator1=,denominator1=,

numerator2=,denominator2=,nb_weights=);

COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS -

%regression (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reg

%logistic_reg (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reglg

NOTE: unless explicitly deleted, the datasets mentioned above

will keep accumulating the results of successive macro calls */

Page 33: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued

/*

%partition(domains=); *no partition if no variable name provided;

Page 34: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued/*

%partition(domains=); *no partition if no variable name provided;

%total(dataset=,variable=,nb_weights=);

COLLECT OUTPUT FROM DATASET: totals

%ratio(dataset=,numerator=,denominator=,nb_weights=);

COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -

%ratio_difference(dataset=,numerator1=,denominator1=,

numerator2=,denominator2=,nb_weights=);

COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS -

%regression (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reg

%logistic_reg (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reglg

NOTE: unless explicitly deleted, the datasets mentioned above

will keep accumulating the results of successive macro calls */

Page 35: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued

%total(dataset=,variable=,nb_weights=);

COLLECT OUTPUT FROM DATASET: totals

Page 36: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued%include

"C:\users\dochcat\bootstrap\NLSCY_VES.sas";

%let weight_path = C:\users\dochcat\bootstrap\Bs_Weights;

%let weights = bvar;

libname wt_lib "&weight_path";

%let data_path = C:\users\dochcat\Data;

%let data = basic_set;

libname dt_lib "&data_path";

%let save_path = C:\users\dochcat\bootstrap\Results;

%let output = table01;

libname sv_lib "&save_path";

Page 37: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued

proc sort data=wt_lib.&weights out=weights;

by childid; run;

proc sort data=dt_lib.&data (where=(cmmcq01>12))

/*keep only teenagers*/ out=dataset;

by childid; run;

data data_and_weights;

merge dataset(in=a) weights(in=b);

by childid;

if a; * keep only the necessary records;

run;

Page 38: Using Bootstrap in NLSCY

Estimate variance of a total by region - continued

/* initialise totals */

proc datasets library=work; delete totals; run;

%partition(domains=cgehd03);

%total(dataset=data_and_weights,

variable=nb_bedrooms,

nb_weights=1000);

/*save results*/

data sv_lib.&output; set totals; run;

proc print data=sv_lib.table01; run;

Page 39: Using Bootstrap in NLSCY

back

Page 40: Using Bootstrap in NLSCY

back

Page 41: Using Bootstrap in NLSCY

Estimate the variance of an average

Problem:

For children of age 6, find the average number of years of

education of the Person Most Knowledgeable about the child

Note:

Even though the average was not mentioned as an available

option, it is easily computed as a ratio.

Page 42: Using Bootstrap in NLSCY

Estimate variance of an average - continued/*

%partition(domains=); *no partition if no variable name provided;

%total(dataset=,variable=,nb_weights=);

COLLECT OUTPUT FROM DATASET: totals

%ratio(dataset=,numerator=,denominator=,nb_weights=);

COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -

%ratio_difference(dataset=,numerator1=,denominator1=,

numerator2=,denominator2=,nb_weights=);

COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS -

%regression (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reg

%logistic_reg (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reglg

NOTE: unless explicitly deleted, the datasets mentioned above

will keep accumulating the results of successive macro calls */

Page 43: Using Bootstrap in NLSCY

Estimate variance of an average - continued

%ratio(dataset=,numerator=,denominator=,nb_weights=);

COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -

Page 44: Using Bootstrap in NLSCY

Estimate variance of an average - continued

%include

"C:\users\dochcat\bootstrap\NLSCY_VES.sas";

%let weight_path = C:\users\dochcat\bootstrap\Bs_Weights;

%let weights = bvar;

libname wt_lib "&weight_path";

%let data_path = C:\users\dochcat\Data;

%let data = basic_set;

libname dt_lib "&data_path";

%let save_path = C:\users\dochcat\bootstrap\Results;

%let output = table02;

libname sv_lib "&save_path";

Page 45: Using Bootstrap in NLSCY

Estimate variance of an average - continued

proc sort data=wt_lib.&weights out=weights;

by childid; run;

proc sort data=dt_lib.&data

(where=(cmmcq01=6 and cedpd04<96))

/*keep only age 6 kids with valid values

of the variable*/

out=dataset; by childid; run;

data data_and_weights; merge dataset(in=a) weights(in=b); by childid; if a; *keep only the necessary records; count=1; *necessary for average calculation;run;

Page 46: Using Bootstrap in NLSCY

Estimate the variance of an average - continued

/* initialise ratios */

proc datasets library=work; delete ratios; run;

%partition(domains=); *ensure no partition;

%ratio(dataset=data_and_weights,

numerator=cedpd04,

denominator=count,

nb_weights=1000);

/* save the results */

data sv_lib.&output; set ratios; run;

proc print data=sv_lib.table02; run;

Page 47: Using Bootstrap in NLSCY

Estimate the variance of an average - results

Note that the result is expressed as a percentage. For our purposes we need to divide the standard errors, confidence limits, and estimates by 100, and the variance by 100*100. The coefficient of variance stays the same.

Hence the results are: Mean=12.8446, with a 95% confidence interval [12.6544 , 13.0348]

Page 48: Using Bootstrap in NLSCY

Estimate the variance of an average - comments

Let us compare the confidence interval we just computed with the confidence interval produced by using only the release weights.

The following SAS code will produce these confidence limits:

proc means mean lclm uclm data=data_and_weights;

var cedpd04;

weight w_final;

run;

And they are …

Page 49: Using Bootstrap in NLSCY

Estimate the variance of an average - comments

… while the bootstrap estimate of the same confidence interval is:

[12.6544 , 13.0348]

This is what we get when we compare:

Bootstrap

Classical

We can see an increase by a factor of about 1.7 - for this variable.

back /intro/skip

Page 50: Using Bootstrap in NLSCY

Estimate the variance of regression coefficients

Problem:

Estimate the variance of regression coefficients of an outcome variable - the PPVT score. The independent variables are: the number of years of education of the PMK, and positive interaction in parenting.

Page 51: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - continued/*

%partition(domains=); *no partition if no variable name provided;

%total(dataset=,variable=,nb_weights=);

COLLECT OUTPUT FROM DATASET: totals

%ratio(dataset=,numerator=,denominator=,nb_weights=);

COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -

%ratio_difference(dataset=,numerator1=,denominator1=,

numerator2=,denominator2=,nb_weights=);

COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS -

%regression (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reg

%logistic_reg (dataset=,dependent=,independent=,nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reglg

NOTE: unless explicitly deleted, the datasets mentioned above

will keep accumulating the results of successive macro calls */

Page 52: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - continued

%regression (dataset=,dependent=,independent=, nb_weights=);

COLLECT OUTPUT FROM DATASET: bs_reg

Page 53: Using Bootstrap in NLSCY

Estimate variance regresion coefficients - continued

%include

"C:\users\dochcat\bootstrap\NLSCY_VES.sas";

%let weight_path = C:\users\dochcat\bootstrap\Bs_Weights;

%let weights = bvar;

libname wt_lib "&weight_path";

%let data_path = C:\users\dochcat\Data;

%let data = basic_set;

libname dt_lib "&data_path";

%let save_path = C:\users\dochcat\bootstrap\Results;

%let output = table03;

libname sv_lib "&save_path";

Page 54: Using Bootstrap in NLSCY

Estimate variance of regresion coefficients - continuedproc sort data=wt_lib.&weights out=weights;

by childid; run;

proc sort data=dt_lib.&data

(where=(cedpd04<96 and cprcs03<96

and cppcs01<888 and cmmcq01 in (6))) out=dataset;

/*always eliminate records with nonresponse codes*/

by childid; run;

data data_and_weights; merge dataset(in=a) weights; by childid; if a; *keep only the necessary records;run;

Page 55: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - continued

%start_chronometer;

/* initialise bs_reg */

proc datasets library=work; delete bs_reg; run;

%partition(domains=);

%regression( dataset=data_and_weights,

dependent=cppcs01,

independent=cprcs03 cedpd04,

nb_weights=100);

/* save the results */

data sv_lib.&output; set bs_reg; run;

%stop_chronometer;

Page 56: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - results

Page 57: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - comments

Now we compare the bootstrap coefficients of variation for the regression parameters with the coefficients of variation produced by the SAS regression procedure.

proc reg data=data_and_weights;model cppcs01 = cprcs03 cedpd04;weight an_weight;

run;quit;

Page 58: Using Bootstrap in NLSCY

Estimate variance of regression coefficients - comments

Here is the SAS output:

From the estimates and their standard errors we compute the CV’s and compare: SAS Bootstrap

Intercept 5.33% 8.75%CPRCS03 27.55% 46.42%CEDPD04 13.08% 21.61%

One can see that the results are very different. In such a situation the SAS results are not to be trusted

back/intro

Page 59: Using Bootstrap in NLSCY

Running Time• Generally the running time is equal to the number of

bootstrap weights used multiplied by the time it takes for one estimate. The most time consuming are the bootstraps for regression or logistic regression.

• As an example, consider a Pentium II 350 Mhz with 128MB memory. On this machine we ran bootstraps for the variance of regression coefficients with 1,000 weights. The running time was usually 45 to 60 minutes.

• Since we are dealing with large files, the computing performance will greatly benefit from large amounts of RAM.

Page 60: Using Bootstrap in NLSCY

Running Time - MeasuringTwo simple macros are provided in our file for measuring the

running time. They are used as follows:…….

%start_chronometer;

%partition(domains=cgehd03);

%total(dataset=data_and_weights,variable=nb_bedrooms,

nb_weights=1000);

%stop_chronometer;

…….

After the call of the %stop_chronometer macro, the time elapsed - in seconds - is written to the SAS log

Page 61: Using Bootstrap in NLSCY

Notes

• Averages are ratios. The user has to set up the necessary variables - see the case of the variable “count” in the examples.

• Generally, curious users may find useful tricks for their programs inside the NLSCY_VES.sas file. Of course, modifying the file itself will make it harder to benefit from the experience of other users.

skip

Page 62: Using Bootstrap in NLSCY

Notes

• The results accumulate in the output dataset when several macros are run in sequence. Only when the respective dataset is deleted explicitly the previous numbers are lost.

• The macros use temporary data sets. Conflicts of names with user defined datasets are possible. If you suspect this is the case inspect the log for all the datasets that are being created or deleted.

Page 63: Using Bootstrap in NLSCY

Notes• The point estimates provided in the results are computed

based on the release weights. Only their variances are computed by using the bootstrap weights. The reason for providing the estimates along with the respective variances is to allow the users to double check their work.

Page 64: Using Bootstrap in NLSCY

Notes

• When testing, it is good to keep in mind that some steps may not need to be executed again. For instance, once the “data_and_weights” are in memory, only the macros and their auxiliary statements are needed. Also, specifying a smaller number of weights may greatly reduce the testing time.

• Sometimes the macros have to output a lot of text to the log window. If you plan to run unsupervised jobs it is advisable to redirect the log output to a file.

Page 65: Using Bootstrap in NLSCY

Conclusions

• The macros presented here allow for an easy utilization of the bootstrap weights. In most situations the user only needs to write code pertaining to his/her dataset, and not to the bootstrap process.

• When more complex estimators are required, the user may need to write custom code. The code of the macros presented here can serve as a template.