Download - Topic 18: Model Selection and Diagnostics
![Page 1: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/1.jpg)
Topic 18: Model Selection and Diagnostics
![Page 2: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/2.jpg)
Variable Selection
• We want to choose a “best” model that is a subset of the available explanatory variables
• Two separate problems–How many explanatory variables
should we use (i.e., subset size)–Given the subset size, which
variables should we choose
![Page 3: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/3.jpg)
KNNL Example
• Page 350, Section 9.2• Y : survival time of patient (liver op)• X’s (explanatory variables) are –Blood clotting score–Prognostic index–Enzyme function test–Liver function test
![Page 4: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/4.jpg)
KNNL Example cont.
• n = 54 patients• Start with the usual plots and
descriptive statistics• Time-to-event data is often
heavily skewed and typically transformed with a log
![Page 5: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/5.jpg)
Data
Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv;run;
Tab delimited
Dummy variables for alcohol use
Ln(surv)
![Page 6: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/6.jpg)
Obs blood prog enz liver age gender alcmod alcheavy surv lsurv1 6.7 62 81 2.59 50 0 1 0 695 6.544
2 5.1 59 66 1.70 39 0 0 0 403 5.999
3 7.4 57 83 2.16 55 0 0 0 710 6.565
4 6.5 73 41 2.01 48 0 0 0 349 5.854
5 7.8 65 115 4.30 45 0 0 1 2343 7.759
6 5.8 38 72 1.42 65 1 1 0 348 5.852
Data
![Page 7: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/7.jpg)
![Page 8: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/8.jpg)
![Page 9: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/9.jpg)
![Page 10: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/10.jpg)
![Page 11: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/11.jpg)
Log Transform of Y
• Recall that regression model does not require Y to be Normally distributed
• In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals
![Page 12: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/12.jpg)
Scatterplotsproc corr plot=matrix; var blood prog enz liver;run;
proc corr plot=scatter; var blood prog enz liver; with lsurv;run;
![Page 13: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/13.jpg)
![Page 14: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/14.jpg)
![Page 15: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/15.jpg)
![Page 16: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/16.jpg)
![Page 17: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/17.jpg)
![Page 18: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/18.jpg)
Correlation Summary
Pearson Correlation Coefficients, N = 54Prob > |r| under H0: Rho=0
blood prog enz liverlsurv 0.24619
0.07270.46994
0.00030.65389<.0001
0.64926<.0001
![Page 19: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/19.jpg)
The Two Problems in Variable Selection
1. To determine an appropriate subset size
– Might use adjusted R2, Cp, MSE,
PRESS, AIC, SBC (BIC)
2. To determine best model of this fixed size– Might use R2
![Page 20: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/20.jpg)
Adjusted R2
• R2 by its construction is guaranteed to increase with p
–SSE cannot decrease with additional X and SSTO constant
• Adjusted R2 uses df to account for p
MSTO
MSE
SSTO
SSE
pn
nR ppa
11
12
![Page 21: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/21.jpg)
Adjusted R2
• Want to find model that maximizes • Since MSTO will remain constant for a given
data set – Depends only on Y
• Equivalent information to MSE• Thus could also find choice of model that
minimizes MSE• Details on pages 354-356
2aR
![Page 22: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/22.jpg)
Cp Criterion
• The basic idea is to compare subset models with the full model
• A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model)
• Looks at the ratio of total mean squared error and the true error variance
• See page 357-359 for details
![Page 23: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/23.jpg)
Cp Criterion
)2()Full(MSE
SSEpnC p
p
SSE based on a specific choice of p-1 variables
MSE based on the full set of variables
Select the full set and Cp=(n-p)-(n-2p)=p
![Page 24: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/24.jpg)
Use of Cp
• p is the number of regression coefficients including the intercept
• A model is good according to this
criterion if Cp ≤ p
• Rule: Pick the smallest model for
which Cp is smaller than p or pick the
model that minimizes Cp, provided
the minimum Cp is smaller than p
![Page 25: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/25.jpg)
SBC (BIC) and AIC
Criterion based on log(likelihood) plus a penalty for more complexity
• AIC – minimize
• SBC – minimize
p2n
SSElogn p
)nlog(pn
SSElogn p
![Page 26: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/26.jpg)
Other approaches
• PRESS (prediction SS)–For each case i–Delete the case and predict Y using
a model based on the other n-1 cases –Look at the SS for observed minus
predicted–Want to minimize the PRESS
![Page 27: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/27.jpg)
Variable Selection • Additional proc reg model statement
options useful in variable selection
– INCLUDE=n forces the first n explanatory variables into all models
–BEST=n limits the output to the best n models of each subset size or total
–START=n limits output to models that include at least n explanatory variables
![Page 28: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/28.jpg)
Variable Selection
• Step type procedures
–Forward selection (Step up)
–Backward elimination (Step down)
–Stepwise (forward selection with a backward glance)
• Very popular but now have much better search techniques like BEST
![Page 29: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/29.jpg)
Ordering models of the same subset size
• Use R2 or SSE• This approach can lead us to consider
several models that give us approximately the same predicted values
• May need to apply knowledge of the subject matter to make a final selection
• Not that important if prediction is the key goal
![Page 30: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/30.jpg)
Proc Reg
proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3;run;
![Page 31: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/31.jpg)
Number inModel R-Square C(p) AIC SBC
1 0.4276 66.4889 -103.8269 -99.848891 0.4215 67.7148 -103.2615 -99.283571 0.2208 108.5558 -87.1781 -83.200112 0.6633 20.5197 -130.4833 -124.516342 0.5995 33.5041 -121.1126 -115.145612 0.5486 43.8517 -114.6583 -108.691383 0.7573 3.3905 -146.1609 -138.204943 0.7178 11.4237 -138.0232 -130.067233 0.6121 32.9320 -120.8442 -112.888234 0.7592 5.0000 -144.5895 -134.64461
Selection Results
![Page 32: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/32.jpg)
Number inModel
Parameter Estimates
Intercept blood prog enz liver1 5.26426 . . 0.01512 .1 5.61218 . . . 0.298191 5.56613 . 0.01367 . .2 4.35058 . 0.01412 0.01539 .2 5.02818 . . 0.01073 0.209452 4.54623 0.10792 . 0.01634 .3 3.76618 0.09546 0.01334 0.01645 .3 4.40582 . 0.01101 0.01261 0.129773 4.78168 0.04482 . 0.01220 0.163604 3.85195 0.08368 0.01266 0.01563 0.03216
Selection Results
![Page 33: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/33.jpg)
Proc Reg
proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3;run;
![Page 34: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/34.jpg)
Selection ResultsNumber in
Model C(p) R-Square AIC SBC3 3.3905 0.7573 -146.1609 -138.20494
4 5.0000 0.7592 -144.5895 -134.64461
3 11.4237 0.7178 -138.0232 -130.06723
WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not
![Page 35: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/35.jpg)
How to Choose with C(p)
1. Want small C(p)
2. Want C(p) near p
In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria
Can be somewhat subjective when determining “near”
![Page 36: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/36.jpg)
Proc Regproc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b;run;quit; symbol1 v=circle i=none;symbol2 v=none i=join;proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay;run;
Creates data set with estimates & criteria
![Page 37: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/37.jpg)
Mallows C(p)
0102030405060708090
100110120130140150
Number of parameters in model
2 3 4 5
Start to approach C(p)=p line here
![Page 38: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/38.jpg)
Model Validation
• Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well
• Want to check model predictive ability for a separate data set
• Various techniques of cross validation (data split, leave one out) are possible
![Page 39: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/39.jpg)
Regression Diagnostics
• Partial regression plots• Studentized deleted residuals• Hat matrix diagonals• Dffits, Cook’s D, DFBETAS• Variance inflation factor• Tolerance
![Page 40: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/40.jpg)
KNNL Example
• Page 386, Section 10.1
• Y is amount of life insurance
• X1 is average annual income
• X2 is a risk aversion score
• n = 18 managers
![Page 41: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/41.jpg)
Read in the data set
data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;
![Page 42: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/42.jpg)
![Page 43: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/43.jpg)
Partial regression plots
• Also called added variable plots or adjusted variable plots
• One plot for each Xi
![Page 44: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/44.jpg)
Partial regression plots• These plots show the strength of the
marginal relationship between Y and Xi in
the full model .
• They can also detect
–Nonlinear relationships
–Heterogeneous variances
–Outliers
![Page 45: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/45.jpg)
Partial regression plots
• Consider plot for X1
–Use the other X’s to predict Y –Use the other X’s to predict X1
–Plot the residuals from the first regression vs the residuals from the second regression
![Page 46: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/46.jpg)
The partial option with proc reg and plots=
proc reg data=a1 plots=partialplot; model insur=income risk /partial;run;
![Page 47: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/47.jpg)
OutputAnalysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 173919 86960 542.33 <.0001Error 15 2405.1476 160.3431 Corrected Total 17 176324
Root MSE 12.66267 R-Square 0.9864Dependent Mean 134.44444 Adj R-Sq 0.9845Coeff Var 9.41851
![Page 48: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/48.jpg)
Output
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t| ToleranceIntercept 1 -205.71866 11.39268 -18.06 <.0001 .
income 1 6.28803 0.20415 30.80 <.0001 0.93524
risk 1 4.73760 1.37808 3.44 0.0037 0.93524
![Page 49: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/49.jpg)
Output
• The partial option on the model statement in proc reg generates graphs in the output window
• These are ok for some purposes but we prefer better looking plots
• To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot
![Page 50: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/50.jpg)
Partial regression plots
*partial regression plot for risk;proc reg data=a1; model insur risk = income; output out=a2 r=resins resris;
symbol1 v=circle i=sm70;proc gplot data=a2; plot resins*resris;run;
![Page 51: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/51.jpg)
The plot for risk
![Page 52: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/52.jpg)
Partial plot for incomecode not shown
![Page 53: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/53.jpg)
![Page 54: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/54.jpg)
Residual plot (vs risk)
proc reg data=a1; model insur= risk income; output out=a2 r=resins;symbol1 v=circle i=sm70;
proc sort data=a2; by risk;proc gplot data=a2; plot resins*risk;run;
![Page 55: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/55.jpg)
Residuals vs Risk
![Page 56: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/56.jpg)
Residual plot (vs income)
proc sort data=a2; by income;proc gplot data=a2; plot resins*income;run;
![Page 57: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/57.jpg)
Residuals vs Income
![Page 58: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/58.jpg)
![Page 59: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/59.jpg)
insur = -205.72 +6.288 income +4.7376 risk
N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663
Res
idua
l
-20
-15
-10
-5
0
5
10
15
20
25
income
25 30 35 40 45 50 55 60 65 70 75 80
![Page 60: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/60.jpg)
insur = -205.72 +6.288 income +4.7376 risk
N 18 Rsq 0.9864AdjRsq0.9845RMSE 12.663
Res
idua
l
-20
-15
-10
-5
0
5
10
15
20
25
risk
1 2 3 4 5 6 7 8 9 10
![Page 61: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/61.jpg)
Other “Residuals”
• There are several versions of residuals1. Our usual residuals
ei= Yi – 2. Studentized residuals
•
•Studentized means dividing by its standard error
•Are distributed t(n-p) ( ≈ Normal)
iY
)hMSE(1
ee
ii
i*i
![Page 62: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/62.jpg)
Other “Residuals”
–Studentized deleted residuals•Delete case i and refit the model•Compute the predicted value for
case i using this refitted model•Compute the “studentized residual”•Don’t do this literally but this is the
concept
![Page 63: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/63.jpg)
Studentized Deleted Residuals
• We use the notation (i) to indicate that case i has been deleted from the model fit computations
• di = Yi - is the deleted residual
• Turns out di = ei/(1-hii)
• Also Var di=(Var ei)/(1-hii)2=MSE(i)/(1- hii)
•
i(i)Y
)1(MSE/e ii(i)ii ht
![Page 64: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/64.jpg)
Residuals
• When we examine the residuals, regardless of version, we are looking for
–Outliers
–Non-normal error distributions
– Influential observations
![Page 65: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/65.jpg)
The r option and studentized residuals
proc reg data=a1; model insur=income risk/r;run;
![Page 66: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/66.jpg)
Output
StudentObs Residual 1 -1.206 2 -0.910 3 2.121 4 -0.363 5 -0.210
![Page 67: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/67.jpg)
The influence option and studentized deleted
residuals
proc reg data=a1; model insur=income risk /influence;run;
![Page 68: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/68.jpg)
Output
Obs Residual RStudent 1 -14.7311 -1.2259 2 -10.9321 -0.9048 3 24.1845 2.4487 4 -4.2780 -0.3518 5 -2.5522 -0.2028 6 10.3417 1.0138
![Page 69: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/69.jpg)
Hat matrix diagonals
• hii is a measure of how much Yi is contributing to the prediction of
• = hi1Y1 + hi2 Y2 + hi3Y3 + …• hii is sometimes called the leverage
of the ith observation• It is a measure of the distance
between the X values for the ith case and the means of the X values
iYiY
![Page 70: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/70.jpg)
Hat matrix diagonals
• 0 ≤ hii ≤ 1• Σ(hii) = p• Large value of hii suggess that ith case
is distant from the center of all X’s • The average value is p/n• Values far from this average point to
cases that should be examined carefully
![Page 71: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/71.jpg)
Influence option gives hat diagonals
Hat DiagObs H 1 0.0693 2 0.1006 3 0.1890 4 0.1316 5 0.0756
![Page 72: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/72.jpg)
DFFITS
• A measure of the influence of case i on (a single case)
• Thus, it is closely related to hii
• It is a standardized version of the difference between computed with and without case i
• Concern if greater than 1 for small data sets or greater than for large data sets
iY
iY
np2
![Page 73: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/73.jpg)
Cook’s Distance
• A measure of the influence of case i on all of the ’s (all the cases)
• It is a standardized version of the sum of squares of the differences between the predicted values computed with and without case I
• Compare with F(p,n-p)
• Concern if distance above 50%-tile
iY
![Page 74: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/74.jpg)
DFBETAS
• A measure of the influence of case i on each of the regression coefficients
• It is a standardized version of the difference between the regression coefficient computed with and without case i
• Concern if DFBETA greater than 1 in small data sets or greater than for large data sets
n/2
![Page 75: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/75.jpg)
Variance Inflation Factor• The VIF is related to the variance of
the estimated regression coefficients
• We calculate it for each explanatory variable
• One suggested rule is that a value of 10 or more for VIF indicates excessive multicollinearity
![Page 76: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/76.jpg)
Tolerance
• TOL = (1-R2k) where R2
k is the squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk
• TOL = 1/VIF
• Described in comment on p 410
![Page 77: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/77.jpg)
Output
Variable Tolerance
Intercept .income 0.93524risk 0.93524
![Page 78: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/78.jpg)
Full diagnostics
proc reg data=a1; model insur=income risk /r partial influence tol; id income risk; plot r.*(income risk);run;
![Page 79: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/79.jpg)
Plot statement inside Reg
• Can generate several plots within Proc Reg
• Need to know symbol names
• Available in Table 1 once you click on plot command inside REG syntax
– r. represents usual residuals
– rstudent. represents deleted resids
–p. represents predicted values
![Page 80: Topic 18: Model Selection and Diagnostics](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815747550346895dc4e9cd/html5/thumbnails/80.jpg)
Last slide
• We went over KNNL Chapters 9 and 10
• We used program topic18.sas to generate the output