final classifier - web.stanford.eduhastie/elemstatlearnii_figures/figures10.pdf · elements of...
TRANSCRIPT
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
G(x) = sign[∑M
m=1 αmGm(x)]
GM (x)
G3(x)
G2(x)
G1(x)
Final Classifier
FIGURE 10.1. Schematic of AdaBoost. Classifiersare trained on weighted versions of the dataset, andthen combined to produce a final prediction.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Boosting Iterations
Tes
t Err
or
Single Stump
244 Node Tree
FIGURE 10.2. Simulated data (10.2): test error ratefor boosting with stumps, as a function of the numberof iterations. Also shown are the test error rate for asingle stump, and a 244-node classification tree.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0 100 200 300 400
0.0
0.2
0.4
0.6
0.8
1.0
Boosting Iterations
Tra
inin
g E
rror
Misclassification Rate
Exponential Loss
FIGURE 10.3. Simulated data, boosting with stumps:misclassification error rate on the training set, and av-
erage exponential loss: (1/N)PN
i=1 exp(−yif(xi)). Af-ter about 250 iterations, the misclassification error iszero, while the exponential loss continues to decrease.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
−2 −1 0 1 2
0.0
0.5
1.0
1.5
2.0
2.5
3.0 Misclassification
ExponentialBinomial DevianceSquared ErrorSupport Vector
Loss
y · f
FIGURE 10.4. Loss functions for two-class classi-fication. The response is y = ±1; the prediction isf , with class prediction sign(f). The losses are mis-classification: I(sign(f) �= y); exponential: exp(−yf);binomial deviance: log(1 + exp(−2yf)); squared er-ror: (y − f)2; and support vector: (1 − yf)+ (see Sec-tion 12.3). Each function has been scaled so that itpasses through the point (0, 1).
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
−3 −2 −1 0 1 2 3
02
46
8
Squared ErrorAbsolute ErrorHuber
Loss
y − f
FIGURE 10.5. A comparison of three loss functionsfor regression, plotted as a function of the margin y−f .The Huber loss function combines the good propertiesof squared-error loss near zero and absolute error losswhen |y − f | is large.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative Importance
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
!
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
removeP
artia
l Dep
ende
nce
0.0 0.2 0.4 0.6
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
edu
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
-0.6
-0.2
0.0
0.2
hp
Par
tial D
epen
denc
e
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-1.0
-0.6
-0.2
0.0
0.2
FIGURE 10.7. Partial dependence of log-odds ofspam on four important predictors. The red ticks atthe base of the plots are deciles of the input variable.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0.51.01.52.02.53.0
0.2
0.4
0.6
0.8
1.0
-1.0
-0.5
0.0
0.5
1.0
hp
!
FIGURE 10.8. Partial dependence of the log-odds ofspam vs. email as a function of joint frequencies of hpand the character !.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Stumps
10 Node100 NodeAdaboost
FIGURE 10.9. Boosting with different sized trees,applied to the example (10.2) used in Figure 10.2. Sincethe generative model is additive, stumps perform thebest. The boosting algorithm used the binomial devianceloss in Algorithm 10.3; shown for comparison is theAdaBoost Algorithm 10.1.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
Coordinate Functions for Additive Logistic Trees
f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)
f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)
FIGURE 10.10. Coordinate functions estimated byboosting stumps for the simulated example used in Fig-ure 10.9. The true quadratic functions are shown forcomparison.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
Boosting Iterations
Tes
t Set
Dev
ianc
e
0 500 1000 1500 2000
0.0
0.5
1.0
1.5
2.0
No shrinkageShrinkage=0.2
StumpsDeviance
Boosting Iterations
Tes
t Set
Mis
clas
sific
atio
n E
rror
0 500 1000 1500 20000.
00.
10.
20.
30.
40.
5
No shrinkageShrinkage=0.2
StumpsMisclassification Error
Boosting Iterations
Tes
t Set
Dev
ianc
e
0 500 1000 1500 2000
0.0
0.5
1.0
1.5
2.0
No shrinkageShrinkage=0.6
6-Node TreesDeviance
Boosting Iterations
Tes
t Set
Mis
clas
sific
atio
n E
rror
0 500 1000 1500 2000
0.0
0.1
0.2
0.3
0.4
0.5
No shrinkageShrinkage=0.6
6-Node TreesMisclassification Error
FIGURE 10.11. Test error curves for simulated ex-ample (10.2) of Figure 10.9, using gradient boosting(MART). The models were trained using binomial de-viance, either stumps or six terminal-node trees, andwith or without shrinkage The left panels report test
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0 200 400 600 800 1000
0.4
0.6
0.8
1.0
1.2
1.4
Boosting Iterations
Tes
t Set
Dev
ianc
e
Deviance
4−Node Trees
0 200 400 600 800 1000
0.30
0.35
0.40
0.45
0.50
Boosting Iterations
Tes
t Set
Abs
olut
e E
rror
No shrinkageShrink=0.1Sample=0.5Shrink=0.1 Sample=0.5
Absolute Error
FIGURE 10.12. Test-error curves for the simulatedexample (10.2), showing the effect of stochasticity. Forthe curves labeled “Sample= 0.5”, a different 50% sub-sample of the training data was used each time a treewas grown. In the left panel the models were fit by gbm
using a binomial deviance loss function; in the right–hand panel using square-error loss.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
Iterations M
Abs
olut
e E
rror
Training and Test Absolute Error
Train ErrorTest Error
FIGURE 10.13. Average-absolute error as a functionof number of iterations for the California housing data.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
MedInc
Longitude
AveOccup
Latitude
HouseAge
AveRooms
AveBedrms
Population
0 20 40 60 80 100
Relative importance
FIGURE 10.14. Relative importance of the predic-tors for the California housing data.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
MedInc
Par
tial D
epen
denc
e
2 4 6 8 10
-0.5
0.0
0.5
1.0
1.5
2.0
AveOccupP
artia
l Dep
ende
nce
2 3 4 5
-1.0
-0.5
0.0
0.5
1.0
1.5
HouseAge
Par
tial D
epen
denc
e
10 20 30 40 50
-1.0
-0.5
0.0
0.5
1.0
AveRooms
Par
tial D
epen
denc
e
4 6 8 10
-1.0
-0.5
0.0
0.5
1.0
1.5
FIGURE 10.15. Partial dependence of housing valueon the nonlocation variables for the California housingdata. The red ticks at the base of the plot are deciles ofthe input variables.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
2
3
4
510
20
30
40
50
0.0
0.5
1.0
AveOccup
HouseAge
FIGURE 10.16. Partial dependence of house valueon median age and average occupancy. There appearsto be a strong interaction effect between these two vari-ables.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
−124 −122 −120 −118 −116 −114
3436
3840
42
Longitude
Latit
ude
−1.0
−0.5
0.0
0.5
1.0
FIGURE 10.17. Partial dependence of median housevalue on location in California. One unit is $100, 000,at 1990 prices, and the values plotted are relative to theoverall median of $180, 000.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
FIGURE 10.18. Map of New Zealand and its sur-rounding exclusive economic zone showing the loca
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
0 500 1000 1500
0.24
0.26
0.28
0.30
0.32
0.34
Number of Trees
Mea
n D
evia
nce
GBM TestGBM CVGAM Test
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ityAUC
GAM 0.97GBM 0.98
FIGURE 10.19. The left panel shows the mean de-viance as a function of the number of trees for the GBMlogistic regression model fit to the presence/absencedata. Shown are 10-fold cross-validation on the train-ing data (and 1 × s.e. bars), and test deviance on thetest data. Also shown for comparison is the test de-viance using a GAM model with 8 df for each term.The right panel shows ROC curves on the test data forthe chosen GBM model (vertical line in left plot) andthe GAM model.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
OrbVelSpeed
DistanceDisOrgMatterCodendSize
PentadeTidalCurr
SlopeChlaCase2
SSTGradSalResid
SusPartMatterAvgDepth
TempResid
Relative influence
0 10 25 −4 0 2 4 6
−7
−5
−3
−1
TempResid
f(T
empR
esid
)
0 500 1000 2000
−6
−4
−2
AvgDepth
f(A
vgD
epth
)
0 5 10 15
−7
−5
−3
SusPartMatter
f(S
usP
artM
atte
r)
−0.8 −0.4 0.0 0.4
−7
−5
−3
−1
SalResid
f(S
alR
esid
)
0.00 0.05 0.10 0.15
−7
−5
−3
−1
SSTGrad
f(S
ST
Gra
d)
FIGURE 10.20. The top-left panel shows the relativeinfluence computed from the GBM logistic regressionmodel. The remaining panels show the partial depen-dence plots for the leading five variables, all plotted onthe same scale for comparison.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
FIGURE 10.21. Geological prediction maps of thepresence probability (left map) and catch size (rightmap) obtained from the gradient boosted models.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
Sales
Unemployed
Military
Clerical
Labor
Homemaker
Prof/Man
Retired
Student
0.0 0.2 0.4 0.6 0.8 1.0
Error Rate
Overall Error Rate = 0.425
FIGURE 10.22. Error rate for each occupation inthe demographics data.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
age
income
edu
hsld-stat
mar-dlinc
sex
ethnic
mar-stat
typ-home
lang
num-hsld
children
yrs-BA
0 20 40 60 80 100
Relative Importance
FIGURE 10.23. Relative importance of the predic-tors as averaged over all classes for the demographicsdata.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
agemar-dlinc
sexethnic
incomehsld-statmar-stat
langtyp-home
childrenedu
num-hsldyrs-BA
0 20 40 60 80 100
Relative Importance
Class = Retired
hsld-statage
incomemar-stat
eduethnic
num-hsldtyp-home
sexmar-dlinc
langyrs-BA
children
0 20 40 60 80 100
Relative Importance
Class = Student
eduincome
agemar-dlinc
ethnichsld-stat
typ-homesex
num-hsldlang
mar-statyrs-BA
children
0 20 40 60 80 100
Relative Importance
Class = Prof/Man
sexmar-dlinc
childrenethnic
num-hsldedu
mar-statlang
typ-homeincome
agehsld-stat
yrs-BA
0 20 40 60 80 100
Relative Importance
Class = Homemaker
FIGURE 10.24. Predictor variable importances sep-arately for each of the four classes with lowest errorrate for the demographics data.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10
age
Par
tial D
epen
denc
e
1 2 3 4 5 6 7
01
23
4
Retired
age
Par
tial D
epen
denc
e1 2 3 4 5 6 7
-2-1
01
2
Student
age
Par
tial D
epen
denc
e
1 2 3 4 5 6 7
-2-1
01
2
Prof/Man
FIGURE 10.25. Partial dependence of the odds ofthree different occupations on age, for the demographicsdata.