topic 18: model selection and diagnostics
DESCRIPTION
Topic 18: Model Selection and Diagnostics. Variable Selection. We want to choose a “best” model that is a subset of the available explanatory variables Two separate problems How many explanatory variables should we use (i.e., subset size) - PowerPoint PPT PresentationTRANSCRIPT
Topic 18: Model Selection and Diagnostics
Variable Selection
• We want to choose a “best” model that is a subset of the available explanatory variables
• Two separate problems–How many explanatory variables
should we use (i.e., subset size)–Given the subset size, which
variables should we choose
KNNL Example
• Page 350, Section 9.2• Y : survival time of patient (liver op)• X’s (explanatory variables) are –Blood clotting score–Prognostic index–Enzyme function test–Liver function test
KNNL Example cont.
• n = 54 patients• Start with the usual plots and
descriptive statistics• Time-to-event data is often
heavily skewed and typically transformed with a log
Data
Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv;run;
Tab delimited
Dummy variables for alcohol use
Ln(surv)
Obs blood prog enz liver age gender alcmod alcheavy surv lsurv1 6.7 62 81 2.59 50 0 1 0 695 6.544
2 5.1 59 66 1.70 39 0 0 0 403 5.999
3 7.4 57 83 2.16 55 0 0 0 710 6.565
4 6.5 73 41 2.01 48 0 0 0 349 5.854
5 7.8 65 115 4.30 45 0 0 1 2343 7.759
6 5.8 38 72 1.42 65 1 1 0 348 5.852
Data
Log Transform of Y
• Recall that regression model does not require Y to be Normally distributed
• In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals
Scatterplotsproc corr plot=matrix; var blood prog enz liver;run;
proc corr plot=scatter; var blood prog enz liver; with lsurv;run;
Correlation Summary
Pearson Correlation Coefficients, N = 54Prob > |r| under H0: Rho=0
blood prog enz liverlsurv 0.24619
0.07270.46994
0.00030.65389<.0001
0.64926<.0001
The Two Problems in Variable Selection
1. To determine an appropriate subset size
– Might use adjusted R2, Cp, MSE,
PRESS, AIC, SBC (BIC)
2. To determine best model of this fixed size– Might use R2
Adjusted R2
• R2 by its construction is guaranteed to increase with p
–SSE cannot decrease with additional X and SSTO constant
• Adjusted R2 uses df to account for p
MSTO
MSE
SSTO
SSE
pn
nR ppa
11
12
Adjusted R2
• Want to find model that maximizes • Since MSTO will remain constant for a given
data set – Depends only on Y
• Equivalent information to MSE• Thus could also find choice of model that
minimizes MSE• Details on pages 354-356
2aR
Cp Criterion
• The basic idea is to compare subset models with the full model
• A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model)
• Looks at the ratio of total mean squared error and the true error variance
• See page 357-359 for details
Cp Criterion
)2()Full(MSE
SSEpnC p
p
SSE based on a specific choice of p-1 variables
MSE based on the full set of variables
Select the full set and Cp=(n-p)-(n-2p)=p
Use of Cp
• p is the number of regression coefficients including the intercept
• A model is good according to this
criterion if Cp ≤ p
• Rule: Pick the smallest model for
which Cp is smaller than p or pick the
model that minimizes Cp, provided
the minimum Cp is smaller than p
SBC (BIC) and AIC
Criterion based on log(likelihood) plus a penalty for more complexity
• AIC – minimize
• SBC – minimize
p2n
SSElogn p
)nlog(pn
SSElogn p
Other approaches
• PRESS (prediction SS)–For each case i–Delete the case and predict Y using
a model based on the other n-1 cases –Look at the SS for observed minus
predicted–Want to minimize the PRESS
Variable Selection • Additional proc reg model statement
options useful in variable selection
– INCLUDE=n forces the first n explanatory variables into all models
–BEST=n limits the output to the best n models of each subset size or total
–START=n limits output to models that include at least n explanatory variables
Variable Selection
• Step type procedures
–Forward selection (Step up)
–Backward elimination (Step down)
–Stepwise (forward selection with a backward glance)
• Very popular but now have much better search techniques like BEST
Ordering models of the same subset size
• Use R2 or SSE• This approach can lead us to consider
several models that give us approximately the same predicted values
• May need to apply knowledge of the subject matter to make a final selection
• Not that important if prediction is the key goal
Proc Reg
proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3;run;
Number inModel R-Square C(p) AIC SBC
1 0.4276 66.4889 -103.8269 -99.848891 0.4215 67.7148 -103.2615 -99.283571 0.2208 108.5558 -87.1781 -83.200112 0.6633 20.5197 -130.4833 -124.516342 0.5995 33.5041 -121.1126 -115.145612 0.5486 43.8517 -114.6583 -108.691383 0.7573 3.3905 -146.1609 -138.204943 0.7178 11.4237 -138.0232 -130.067233 0.6121 32.9320 -120.8442 -112.888234 0.7592 5.0000 -144.5895 -134.64461
Selection Results
Number inModel
Parameter Estimates
Intercept blood prog enz liver1 5.26426 . . 0.01512 .1 5.61218 . . . 0.298191 5.56613 . 0.01367 . .2 4.35058 . 0.01412 0.01539 .2 5.02818 . . 0.01073 0.209452 4.54623 0.10792 . 0.01634 .3 3.76618 0.09546 0.01334 0.01645 .3 4.40582 . 0.01101 0.01261 0.129773 4.78168 0.04482 . 0.01220 0.163604 3.85195 0.08368 0.01266 0.01563 0.03216
Selection Results
Proc Reg
proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3;run;
Selection ResultsNumber in
Model C(p) R-Square AIC SBC3 3.3905 0.7573 -146.1609 -138.20494
4 5.0000 0.7592 -144.5895 -134.64461
3 11.4237 0.7178 -138.0232 -130.06723
WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not
How to Choose with C(p)
1. Want small C(p)
2. Want C(p) near p
In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria
Can be somewhat subjective when determining “near”
Proc Regproc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b;run;quit; symbol1 v=circle i=none;symbol2 v=none i=join;proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay;run;
Creates data set with estimates & criteria
Mallows C(p)
0102030405060708090
100110120130140150
Number of parameters in model
2 3 4 5
Start to approach C(p)=p line here
Model Validation
• Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well
• Want to check model predictive ability for a separate data set
• Various techniques of cross validation (data split, leave one out) are possible
Regression Diagnostics
• Partial regression plots• Studentized deleted residuals• Hat matrix diagonals• Dffits, Cook’s D, DFBETAS• Variance inflation factor• Tolerance
KNNL Example
• Page 386, Section 10.1
• Y is amount of life insurance
• X1 is average annual income
• X2 is a risk aversion score
• n = 18 managers
Read in the data set
data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;
Partial regression plots
• Also called added variable plots or adjusted variable plots
• One plot for each Xi
Partial regression plots• These plots show the strength of the
marginal relationship between Y and Xi in
the full model .
• They can also detect
–Nonlinear relationships
–Heterogeneous variances
–Outliers
Partial regression plots
• Consider plot for X1
–Use the other X’s to predict Y –Use the other X’s to predict X1
–Plot the residuals from the first regression vs the residuals from the second regression
The partial option with proc reg and plots=
proc reg data=a1 plots=partialplot; model insur=income risk /partial;run;
OutputAnalysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 173919 86960 542.33 <.0001Error 15 2405.1476 160.3431 Corrected Total 17 176324
Root MSE 12.66267 R-Square 0.9864Dependent Mean 134.44444 Adj R-Sq 0.9845Coeff Var 9.41851
Output
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t| ToleranceIntercept 1 -205.71866 11.39268 -18.06 <.0001 .
income 1 6.28803 0.20415 30.80 <.0001 0.93524
risk 1 4.73760 1.37808 3.44 0.0037 0.93524
Output
• The partial option on the model statement in proc reg generates graphs in the output window
• These are ok for some purposes but we prefer better looking plots
• To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot
Partial regression plots
*partial regression plot for risk;proc reg data=a1; model insur risk = income; output out=a2 r=resins resris;
symbol1 v=circle i=sm70;proc gplot data=a2; plot resins*resris;run;
The plot for risk
Partial plot for incomecode not shown
Residual plot (vs risk)
proc reg data=a1; model insur= risk income; output out=a2 r=resins;symbol1 v=circle i=sm70;
proc sort data=a2; by risk;proc gplot data=a2; plot resins*risk;run;
Residuals vs Risk
Residual plot (vs income)
proc sort data=a2; by income;proc gplot data=a2; plot resins*income;run;
Residuals vs Income
insur = -205.72 +6.288 income +4.7376 risk
N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663
Res
idua
l
-20
-15
-10
-5
0
5
10
15
20
25
income
25 30 35 40 45 50 55 60 65 70 75 80
insur = -205.72 +6.288 income +4.7376 risk
N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663
Res
idua
l
-20
-15
-10
-5
0
5
10
15
20
25
risk
1 2 3 4 5 6 7 8 9 10
Other “Residuals”
• There are several versions of residuals1. Our usual residuals
ei= Yi – 2. Studentized residuals
•
•Studentized means dividing by its standard error
•Are distributed t(n-p) ( ≈ Normal)
iY
)hMSE(1
ee
ii
i*i
Other “Residuals”
–Studentized deleted residuals•Delete case i and refit the model•Compute the predicted value for
case i using this refitted model•Compute the “studentized residual”•Don’t do this literally but this is the
concept
Studentized Deleted Residuals
• We use the notation (i) to indicate that case i has been deleted from the model fit computations
• di = Yi - is the deleted residual
• Turns out di = ei/(1-hii)
• Also Var di=(Var ei)/(1-hii)2=MSE(i)/(1- hii)
•
i(i)Y
)1(MSE/e ii(i)ii ht
Residuals
• When we examine the residuals, regardless of version, we are looking for
–Outliers
–Non-normal error distributions
– Influential observations
The r option and studentized residuals
proc reg data=a1; model insur=income risk/r;run;
Output
StudentObs Residual 1 -1.206 2 -0.910 3 2.121 4 -0.363 5 -0.210
The influence option and studentized deleted
residuals
proc reg data=a1; model insur=income risk /influence;run;
Output
Obs Residual RStudent 1 -14.7311 -1.2259 2 -10.9321 -0.9048 3 24.1845 2.4487 4 -4.2780 -0.3518 5 -2.5522 -0.2028 6 10.3417 1.0138
Hat matrix diagonals
• hii is a measure of how much Yi is contributing to the prediction of
• = hi1Y1 + hi2 Y2 + hi3Y3 + …• hii is sometimes called the leverage
of the ith observation• It is a measure of the distance
between the X values for the ith case and the means of the X values
iYiY
Hat matrix diagonals
• 0 ≤ hii ≤ 1• Σ(hii) = p• Large value of hii suggess that ith case
is distant from the center of all X’s • The average value is p/n• Values far from this average point to
cases that should be examined carefully
Influence option gives hat diagonals
Hat DiagObs H 1 0.0693 2 0.1006 3 0.1890 4 0.1316 5 0.0756
DFFITS
• A measure of the influence of case i on (a single case)
• Thus, it is closely related to hii
• It is a standardized version of the difference between computed with and without case i
• Concern if greater than 1 for small data sets or greater than for large data sets
iY
iY
np2
Cook’s Distance
• A measure of the influence of case i on all of the ’s (all the cases)
• It is a standardized version of the sum of squares of the differences between the predicted values computed with and without case I
• Compare with F(p,n-p)
• Concern if distance above 50%-tile
iY
DFBETAS
• A measure of the influence of case i on each of the regression coefficients
• It is a standardized version of the difference between the regression coefficient computed with and without case i
• Concern if DFBETA greater than 1 in small data sets or greater than for large data sets
n/2
Variance Inflation Factor• The VIF is related to the variance of
the estimated regression coefficients
• We calculate it for each explanatory variable
• One suggested rule is that a value of 10 or more for VIF indicates excessive multicollinearity
Tolerance
• TOL = (1-R2k) where R2
k is the squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk
• TOL = 1/VIF
• Described in comment on p 410
Output
Variable Tolerance
Intercept .income 0.93524risk 0.93524
Full diagnostics
proc reg data=a1; model insur=income risk /r partial influence tol; id income risk; plot r.*(income risk);run;
Plot statement inside Reg
• Can generate several plots within Proc Reg
• Need to know symbol names
• Available in Table 1 once you click on plot command inside REG syntax
– r. represents usual residuals
– rstudent. represents deleted resids
–p. represents predicted values
Last slide
• We went over KNNL Chapters 9 and 10
• We used program topic18.sas to generate the output