lecture 4 - logistic regression + residual analysiscr173/sta444_sp18/slides/lec04/lec04.pdf ·...
TRANSCRIPT
![Page 1: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/1.jpg)
Lecture 4Logistic Regression + Residual Analysis
Colin Rundel1/29/2018
1
![Page 2: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/2.jpg)
Background
Today we’ll be looking at data on the presence and absence of theshort-finned eel (Anguilla australis) at a number of sites in New Zealand.
These data come from
• Leathwick, J. R., Elith, J., Chadderton, W. L., Rowe, D. and Hastie, T. (2008),Dispersal, disturbance and the contrasting biogeographies of NewZealand’s diadromous and non-diadromous fish species. Journal ofBiogeography, 35: 1481–1497.
2
![Page 3: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/3.jpg)
Species Distribution
3
![Page 4: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/4.jpg)
Codebook:
• presence - presence (1) or absence (0) of Anguilla australis at thesampling location
• SegSumT - Summer air temperature (degrees C)• DSDist - Distance to coast (km)• DSMaxSlope - Maximum downstream slope (degrees)• USRainDays - days per month with rain greater than 25 mm• USSlope - average slope in the upstream catchment (degrees)• USNative - area with indigenous forest (proportion)• DSDam - Presence of known downstream obstructions, mostly dams• Method - fishing method (electric, net, spot, trap, or mixture)• LocSed - weighted average of proportional cover of bed sediment
1. mud2. sand3. fine gravel4. coarse gravel5. cobble6. boulder7. bedrock 4
![Page 5: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/5.jpg)
Data
anguilla## # A tibble: 617 x 10## presence SegSumT DSDist DSMaxSlope USRainDays USSlope USNative DSDam## * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>## 1 1 18.7 133 1.15 1.15 8.30 0.340 0## 2 0 18.3 107 0.570 0.847 0.400 0 0## 3 0 16.7 167 1.72 0.210 0.400 0.220 1## 4 0 15.1 11.2 1.72 3.30 25.7 1.00 0## 5 0 12.7 42.4 2.86 0.430 9.60 0.0900 0## 6 1 18.2 94.4 3.43 0.847 20.5 0.920 0## 7 1 18.3 91.9 1.72 0.861 6.70 0.580 1## 8 1 17.1 6.80 0.520 0.620 0.700 0 0## 9 0 13.4 190 3.43 0.770 20.1 0.990 0## 10 0 13.1 224 6.84 0.290 9.80 0.980 0## # ... with 607 more rows, and 2 more variables: Method <fct>, LocSed <dbl>
5
![Page 6: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/6.jpg)
EDA
Corr:
−0.325
Corr:
−0.383
Corr:0.304
Corr:
0.0878
Corr:−0.297
Corr:
−0.0717
Corr:
−0.192
Corr:0.105
Corr:
0.341
Corr:
0.0933
Corr:
−0.318
Corr:0.133
Corr:
0.346
Corr:
0.223
Corr:0.637
Corr:
−0.319
Corr:0.0896
Corr:
0.248
Corr:
0.112
Corr:0.379
Corr:
0.396
presence SegSumT DSDist DSMaxSlopeUSRainDays USSlope USNative DSDam Method LocSed presenceSegSum
TDS
DistD
SM
axSlope
US
RainD
aysU
SS
lopeUS
NativeD
SD
amM
ethodLocS
ed
02040600204060 12.515.017.520.00100200300 0 5 101520 1 2 3 0 1020300.000.250.500.751.000204060020406002040600204060020406002040600204060 2 4 6
0100200300400500
12.515.017.520.0
0100200300400
05
101520
123
0102030
0.000.250.500.751.00
0100200300400
0100200300400
01002003004000100200300400010020030040001002003004000100200300400
246
6
![Page 7: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/7.jpg)
EDA (part 1)
Corr:
−0.325
Corr:
−0.383
Corr:
0.304
Corr:
0.0878
Corr:
−0.297
Corr:−0.0717
presence SegSumT DSDist DSMaxSlope USRainDayspresence
SegS
umT
DS
Dist
DS
MaxS
lopeU
SR
ainDays
0 204060 0 204060 12.5 15.0 17.5 20.00 100 200 300 0 5 10 15 20 1 2 3
0100200300400500
12.5
15.0
17.5
20.0
0
100
200
300
400
05
101520
1
2
3
7
![Page 8: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/8.jpg)
EDA (part 2)
Corr:
0.637
Corr:
0.379
Corr:
0.396
presence USSlope USNative DSDam Method LocSedpresence
US
Slope
US
Native
DS
Dam
Method
LocSed
0 2040600 204060 0 10 20 30 0.000.250.500.751.000 2040600 20406002040600204060020406002040600204060 2 4 6
0100200300400500
0102030
0.000.250.500.751.00
0100200300400
0100200300400
0100200300400
0100200300400
0100200300400
0100200300400
0100200300400
2
4
6
8
![Page 9: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/9.jpg)
EDA (part 1)
Corr:
−0.325
Corr:
−0.383
Corr:
0.304
Corr:
0.0878
Corr:
−0.297
Corr:−0.0717
presence SegSumT DSDist DSMaxSlope USRainDayspresence
SegS
umT
DS
Dist
DS
MaxS
lopeU
SR
ainDays
0 204060 0 204060 12.5 15.0 17.5 20.00 100 200 300 0 5 10 15 20 1 2 3
0100200300400500
12.5
15.0
17.5
20.0
0
100
200
300
400
05
101520
1
2
3
9
![Page 10: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/10.jpg)
EDA (part 3)
presence SegSumTpresence
SegS
umT
0 10 20 30 40 50 0 10 20 30 40 50 12.5 15.0 17.5 20.0
0
100
200
300
400
500
12.5
15.0
17.5
20.0
10
![Page 11: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/11.jpg)
Simple Model
11
![Page 12: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/12.jpg)
Model
inv_logit = function(x) 1/(1+exp(-x))
g = glm(presence~SegSumT, family=binomial, data=anguilla)summary(g)#### Call:## glm(formula = presence ~ SegSumT, family = binomial, data = anguilla)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.5755 -0.6260 -0.3452 -0.1299 3.0039#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -16.74184 1.65897 -10.092 <2e-16 ***## SegSumT 0.90009 0.09413 9.562 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 621.91 on 616 degrees of freedom## Residual deviance: 479.39 on 615 degrees of freedom## AIC: 483.39#### Number of Fisher Scoring iterations: 6
12
![Page 13: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/13.jpg)
Fit
d_g = anguilla %>%mutate(p_hat = predict(g, anguilla, type=”response”))
d_g_pred = data.frame(SegSumT = seq(11,25,by=0.1)) %>%modelr::add_predictions(g,”p_hat”) %>%mutate(p_hat = inv_logit(p_hat))
0.0
0.4
0.8
10 15 20 25
SegSumT
pres
ence
13
![Page 14: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/14.jpg)
Separation
ggplot(d_g, aes(x=p_hat, y=presence, color=as.factor(presence))) +geom_jitter(height=0.1, alpha=0.5) +labs(color=”presence”)
0.0
0.4
0.8
0.0 0.2 0.4 0.6
p_hat
pres
ence presence
0
1
14
![Page 15: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/15.jpg)
Residuals
d_g = d_g %>% mutate(resid = presence - p_hat)
ggplot(d_g, aes(x=SegSumT, y=resid)) +geom_point(alpha=0.1)
−0.5
0.0
0.5
1.0
12.5 15.0 17.5 20.0
SegSumT
resi
d
15
![Page 16: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/16.jpg)
Binned Residuals
d_g %>%mutate(bin = p_hat - (p_hat %% 0.05)) %>%group_by(bin) %>%summarize(resid_mean = mean(resid)) %>%ggplot(aes(y=resid_mean, x=bin)) +
geom_point()
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.0 0.2 0.4 0.6
bin
resi
d_m
ean
16
![Page 17: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/17.jpg)
Pearson Residuals
𝑟𝑖 = 𝑌𝑖 − 𝐸(𝑌𝑖)𝑉 𝑎𝑟(𝑌𝑖)
= 𝑌𝑖 − ̂𝑝𝑖̂𝑝𝑖(1 − ̂𝑝𝑖)
d_g = d_g %>% mutate(pearson = (presence - p_hat) / (p_hat * (1-p_hat)))
0
25
50
75
0.0 0.2 0.4 0.6
p_hat
pear
son
17
![Page 18: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/18.jpg)
Binned Pearson Residuals
−1.5
−1.0
−0.5
0.0
0.5
0.0 0.2 0.4 0.6
bin
pear
son_
mea
n
18
![Page 19: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/19.jpg)
Deviance Residuals
𝑑𝑖 = sign(𝑌𝑖 − ̂𝑝𝑖)√−2 (𝑌𝑖 log ̂𝑝𝑖 + (1 − 𝑌𝑖) log(1 − ̂𝑝𝑖))d_g = d_g %>%
mutate(deviance = sign(presence - p_hat) *sqrt(-2 * (presence*log(p_hat) + (1-presence)*log(1 - p_hat) )))
−1
0
1
2
3
0.0 0.2 0.4 0.6
p_hat
devi
ance
19
![Page 20: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/20.jpg)
Binned Deviance Residuals
−0.75
−0.50
−0.25
0.00
0.25
0.0 0.2 0.4 0.6
bin
devi
ance
_mea
n
20
![Page 21: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/21.jpg)
Checking Deviance
sum(d_g$deviance^2)## [1] 479.3914
glm(presence~SegSumT, family=binomial, data=anguilla)#### Call: glm(formula = presence ~ SegSumT, family = binomial, data = anguilla)#### Coefficients:## (Intercept) SegSumT## -16.7418 0.9001#### Degrees of Freedom: 616 Total (i.e. Null); 615 Residual## Null Deviance: 621.9## Residual Deviance: 479.4 AIC: 483.4
21
![Page 22: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/22.jpg)
All together
deviance
pearson
resid
0.0 0.2 0.4 0.6
−0.4
−0.3
−0.2
−0.1
0.0
0.1
−1.5
−1.0
−0.5
0.0
0.5
−0.75
−0.50
−0.25
0.00
0.25
bin
mea
n
type
resid
pearson
deviance
22
![Page 23: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/23.jpg)
Full Model
23
![Page 24: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/24.jpg)
Model
f = glm(presence~., family=binomial, data=anguilla)summary(f)#### Call:## glm(formula = presence ~ ., family = binomial, data = anguilla)#### Deviance Residuals:## Min 1Q Median 3Q Max## -2.10254 -0.53092 -0.27156 -0.08821 3.12463#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -11.554287 1.872102 -6.172 6.75e-10 ***## SegSumT 0.765864 0.103173 7.423 1.14e-13 ***## DSDist -0.002551 0.002103 -1.213 0.22523## DSMaxSlope -0.062525 0.063093 -0.991 0.32169## USRainDays -0.619025 0.227316 -2.723 0.00647 **## USSlope -0.041399 0.024657 -1.679 0.09315 .## USNative -0.607045 0.475456 -1.277 0.20169## DSDam -0.922073 0.483492 -1.907 0.05651 .## Methodmixture -0.231175 0.498189 -0.464 0.64263## Methodnet -1.229762 0.534845 -2.299 0.02149 *## Methodspo -1.493876 0.733468 -2.037 0.04168 *## Methodtrap -2.476408 0.628486 -3.940 8.14e-05 ***## LocSed -0.175944 0.098204 -1.792 0.07319 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 621.91 on 616 degrees of freedom## Residual deviance: 420.18 on 604 degrees of freedom## AIC: 446.18#### Number of Fisher Scoring iterations: 6
24
![Page 25: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/25.jpg)
Separation
0.0
0.4
0.8
0.0 0.2 0.4 0.6
p_hat
pres
ence presence
0
1
SegSumT Model
0.0
0.4
0.8
0.00 0.25 0.50 0.75
p_hat
pres
ence presence
0
1
Full Model
25
![Page 26: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/26.jpg)
Residuals vs fitted
resid
pearson
deviance
0.00 0.25 0.50 0.75
−1.0
−0.5
0.0
0.5
−3
−2
−1
0
1
−0.4
−0.2
0.0
0.2
bin
mea
n
type
deviance
pearson
resid
26
![Page 27: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/27.jpg)
Model Performance
27
![Page 28: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/28.jpg)
Confusion Tables
0.0
0.4
0.8
0.00 0.25 0.50 0.75
p_hat
pres
ence presence
0
1
Full Model
28
![Page 29: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/29.jpg)
Predictive Performance (ROC / AUC)
AUC = c(0.878, 0.824)
0.00
0.10
0.25
0.50
0.75
0.90
1.00
0.00 0.10 0.25 0.50 0.75 0.90 1.00
False positive fraction
True
pos
itive
frac
tion
model
Full Model
SegSumT Model
29
![Page 30: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/30.jpg)
Out of sample predictive performance
AUC = c(0.824, 0.748, 0.878, 0.828)
0.00
0.10
0.25
0.50
0.75
0.90
1.00
0.00 0.10 0.25 0.50 0.75 0.90 1.00
False positive fraction
True
pos
itive
frac
tion
type
Train
Test
model
SegSumT Model
Full Model
30
![Page 31: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/31.jpg)
What about something non-parametric?
31
![Page 32: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/32.jpg)
Gradient Boosting Model
y = anguilla$presence %>% as.integer()x = model.matrix(presence~.-1, data=anguilla)x_test = model.matrix(presence~.-1, data=anguilla_test)
xg = xgboost::xgboost(data=x, label=y, nthead=4, nround=25,objective=”binary:logistic”, verbose = FALSE)
MethodspoMethodnetDSDamMethodmixtureMethodtrapMethodelectricDSMaxSlopeUSSlopeUSNativeUSRainDaysLocSedDSDistSegSumT
Relative importance
0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0
32
![Page 33: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/33.jpg)
Residuals?
resid pearson deviance
0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
−1.0
−0.5
0.0
0.5
1.0
−1
0
1
−0.25
0.00
0.25
bin
mea
n
type
resid
pearson
deviance
33
![Page 34: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/34.jpg)
Separation?
0.0
0.4
0.8
0.00 0.25 0.50 0.75 1.00
p_hat
pres
ence presence
0
1
34
![Page 35: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/35.jpg)
Effect of nround - Training Data
0.0
0.4
0.8
0.2 0.4 0.6 0.8
p_hat
pres
ence presence
0
1
XGBoost − 5 rounds − Training Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75
p_hat
pres
ence presence
0
1
XGBoost − 10 rounds − Training Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75 1.00
p_hat
pres
ence presence
0
1
XGBoost − 15 rounds − Training Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75 1.00
p_hat
pres
ence presence
0
1
XGBoost − 25 rounds − Training Data
35
![Page 36: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/36.jpg)
Effect of nround - Test Data
0.0
0.4
0.8
0.2 0.4 0.6 0.8
p_hat
pres
ence presence
0
1
XGBoost − 5 rounds − Test Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75
p_hat
pres
ence presence
0
1
XGBoost − 10 rounds − Test Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75
p_hat
pres
ence presence
0
1
XGBoost − 15 rounds − Test Data
0.0
0.4
0.8
0.00 0.25 0.50 0.75 1.00
p_hat
pres
ence presence
0
1
XGBoost − 25 rounds − Test Data
36
![Page 37: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/37.jpg)
ROC Curves
0.00
0.10
0.25
0.50
0.75
0.90
1.00
0.00 0.10 0.25 0.50 0.75 0.90 1.00
False positive fraction
True
pos
itive
frac
tion
type
Train
Test
model
SegSumT Model
Full Model
XGBoost − 5 rounds
XGBoost − 10 rounds
XGBoost − 15 rounds
XGBoost − 25 rounds
37
![Page 38: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/38.jpg)
Aside: Species Distribution Modeling
38
![Page 39: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/39.jpg)
Model Choice
We have been fitting a model that looks like the following,
𝑦𝑖 ∼ Bern(𝑝𝑖)logit(𝑝𝑖) = 𝐗𝑖⋅𝜷
Interpretation of 𝑦𝑖 and 𝑝𝑖?
39
![Page 40: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/40.jpg)
Absence of evidence …
If we observe a species at a particular location what does that tell us?
If we don’t observe a species at a particular location what does that tell us?
40
![Page 41: Lecture 4 - Logistic Regression + Residual Analysiscr173/Sta444_Sp18/slides/Lec04/Lec04.pdf · 2018. 1. 30. · 400 500 12.5 15.0 17.5 20.0 0 100 200 300 400 0 5 10 15 20 1 2 3 0](https://reader033.vdocuments.us/reader033/viewer/2022060715/607b59a2f6b8f007f806d98a/html5/thumbnails/41.jpg)
Revised Model
If we allow for crypsis, then
𝑦𝑖 ∼ Bern(𝑞𝑖 𝑧𝑖)𝑧𝑖 ∼ Bern(𝑝𝑖)
logit(𝑞𝑖) = 𝐗𝑖⋅𝜸logit(𝑝𝑖) = 𝐗𝑖⋅𝜷
Interpretation of 𝑦𝑖, 𝑧𝑖, 𝑝𝑖, and 𝑞𝑖?
41