model selection
DESCRIPTION
Model Selection. In multiple regression we often have many explanatory variables. How do we find the “best” model?. Model Selection. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/1.jpg)
1
Model Selection In multiple regression we
often have many explanatory variables.
How do we find the “best” model?
![Page 2: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/2.jpg)
2
Model Selection How can we select the set of
explanatory variables that will explain the most variation in the response and have each variable adding significantly to the model?
![Page 3: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/3.jpg)
3
Cruising Timber Response: Mean Diameter at
Breast Height (MDBH) of a tree. Explanatory:
X1 = Mean Height of Pines X2 = Age of Tract times the Number of Pines X3 = Mean Height of Pines divided by the Number of Pines
![Page 4: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/4.jpg)
4
Forward Selection Begin with no variables in the
model. At each step check to see if
you can add a variable to the model. If you can, add the variable. If not, stop.
![Page 5: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/5.jpg)
5
Forward Selection – Step 1
Select the variable that has the highest correlation with the response.
If this correlation is statistically significant, add the variable to the model.
![Page 6: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/6.jpg)
6
JMP Multivariate Methods Multivariate Put MDBH, X1, X2, and X3 in
the Y, Columns box.
![Page 7: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/7.jpg)
7
MDBH
X1
X2
X3
1.0000
0.7731
0.2435
0.8404
0.7731
1.0000
0.7546
0.6345
0.2435
0.7546
1.0000
0.0557
0.8404
0.6345
0.0557
1.0000
MDBH X1 X2 X3
Correlations
4.5
5.5
6.5
7.5
25
35
45
55
5000
7500
10000
12500
15000
17500
0.04
0.06
0.08
0.1
0.12
MDBH
4.5 5.5 6.5 7.5
X1
25 35 45 55
X2
5000 12500
X3
0.04 0.07 0.1
Scatterplot Matrix
Multivariate
![Page 8: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/8.jpg)
8
Correlation with response
X1
X2
X2
X3
X3
X3
Variable
MDBH
MDBH
X1
MDBH
X1
X2
by Variable
0.7731
0.2435
0.7546
0.8404
0.6345
0.0557
Correlation
20
20
20
20
20
20
Count
<.0001*
0.3008
0.0001*
<.0001*
0.0027*
0.8156
Signif Prob -.8 -.6 -.4 -.2 0 .2 .4 .6 .8
Pairwise Correlations
Multivariate
![Page 9: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/9.jpg)
9
Comment The explanatory variable X3 has
the highest correlation with the response MDBH. r = 0.8404
The correlation between X3 and MDBH is statistically significant. Signif Prob < 0.0001, small P-value.
![Page 10: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/10.jpg)
10
Step 1 - Action Fit the simple linear
regression of MDBH on X3. Predicted MDBH = 3.896 +
32.937*X3
R2 = 0.7063 RMSE = 0.4117
![Page 11: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/11.jpg)
11
SLR of MDBH on X3
Test of Model Utility F = 43.2886, P-value < 0.0001
Statistical Significance of X3
t = 6.58, P-value < 0.0001 Exactly the same as the test
for significant correlation.
![Page 12: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/12.jpg)
12
Can we do better? Can we explain more
variation in MDBH by adding one of the other variables to the model with X3?
Will that addition be statistically significant?
![Page 13: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/13.jpg)
13
Forward Selection – Step 2
Which variable should we add, X1 or X2?
How can we decide?
![Page 14: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/14.jpg)
14
Correlation among explanatory variables
X1
X2
X2
X3
X3
X3
Variable
MDBH
MDBH
X1
MDBH
X1
X2
by Variable
0.7731
0.2435
0.7546
0.8404
0.6345
0.0557
Correlation
20
20
20
20
20
20
Count
<.0001*
0.3008
0.0001*
<.0001*
0.0027*
0.8156
Signif Prob -.8 -.6 -.4 -.2 0 .2 .4 .6 .8
Pairwise Correlations
Multivariate
![Page 15: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/15.jpg)
15
Multicollinearity Because some explanatory
variables are correlated, they may carry overlapping information about the response.
You can’t rely on the simple correlations between explanatory and response to tell you which variable to add.
![Page 16: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/16.jpg)
16
Forward selection – Step 2
Look at partial residual plots.
Determine statistical significance.
![Page 17: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/17.jpg)
17
Partial Residual Plots Look at the residuals from
the SLR of Y on X3 plotted against the other variables once the overlapping information with X3 has been removed.
![Page 18: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/18.jpg)
18
How is this done? Fit MDBH versus X3 and obtain
residuals – Resid(Y on X3) Fit X1 versus X3 and obtain
residuals - Resid(X1 on X3) Fit X2 versus X3 and obtain
residuals - Resid(X2 on X3)
![Page 19: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/19.jpg)
19
-0.5
0
0.5
-10
-5
0
5
10
1520
-2500
0
2500
5000
7500
10000
Resid(YonX3)
-0.5 0 .5
Resid(X1onX3)
-10 -5 0 5 10 15 20
Resid(X2onX3)
-2500 0 2500 7500
![Page 20: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/20.jpg)
20
CorrelationsResid(YonX3) Resid(X1onX
3)Resid(X2onX3)
Resid(YonX3) 1.0000 0.5726 0.3636
Resid(X1onX3)
0.5726 1.0000 0.9320
Resid(X2onX3)
0.3636 0.9320 1.000
![Page 21: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/21.jpg)
21
Comment The residuals (unexplained
variation in the response) from the SLR of MDBH on X3 have the highest correlation with X1 once we have adjusted for the overlapping information with X3.
![Page 22: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/22.jpg)
22
Statistical Significance Does X1 add significantly to the
model that already contains X3? t = 2.88, P-value = 0.0104 F = 8.29, P-value = 0.0104 Because the P-value is small, X1 adds significantly to the model with X3.
![Page 23: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/23.jpg)
23
Summary Step 1 – add X3
R2 = 0.706 Step 2 – add X1 to X3
R2 = 0.803 Can we do better?
![Page 24: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/24.jpg)
24
Forward Selection – Step 3 Does X2 add significantly to the
model that already contains X3 and X1? t = –2.79, P-value = 0.0131 F = 7.78, P-value = 0.0131 Because the P-value is small, X2
adds significantly to the model with X3 and X1.
![Page 25: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/25.jpg)
25
Summary Step 1 – add X3
R2 = 0.706 Step 2 – add X1 to X3
R2 = 0.803 Step 3 – add X2 to X1 and X3
R2 = 0.867
![Page 26: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/26.jpg)
26
Summary At each step the variable
being added is statistically significant.
Has the forward selection procedure found the “best” model?
![Page 27: Model Selection](https://reader036.vdocuments.us/reader036/viewer/2022062309/56814bfe550346895db8fc73/html5/thumbnails/27.jpg)
27
“Best” Model? The model with all three
variables is useful. F = 34.83, P-value < 0.0001
The variable X3 does not add significantly to the model with just X1 and X2. t = 0.41, P-value = 0.6844