final classifier - web.stanford.eduhastie/elemstatlearnii_figures/figures10.pdf · elements of...

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 10

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

G(x) = sign[∑M

m=1 αmGm(x)]

GM (x)

G3(x)

G2(x)

G1(x)

Final Classifier

FIGURE 10.1. Schematic of AdaBoost. Classifiersare trained on weighted versions of the dataset, andthen combined to produce a final prediction.


0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Boosting Iterations

Tes

t Err

or

Single Stump

244 Node Tree

FIGURE 10.2. Simulated data (10.2): test error ratefor boosting with stumps, as a function of the numberof iterations. Also shown are the test error rate for asingle stump, and a 244-node classification tree.


0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

Boosting Iterations

Tra

inin

g E

rror

Misclassification Rate

Exponential Loss

FIGURE 10.3. Simulated data, boosting with stumps:misclassification error rate on the training set, and av-

erage exponential loss: (1/N)PN

i=1 exp(−yif(xi)). Af-ter about 250 iterations, the misclassification error iszero, while the exponential loss continues to decrease.


−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0 Misclassification

ExponentialBinomial DevianceSquared ErrorSupport Vector

Loss

y · f

FIGURE 10.4. Loss functions for two-class classi-fication. The response is y = ±1; the prediction isf , with class prediction sign(f). The losses are mis-classification: I(sign(f) �= y); exponential: exp(−yf);binomial deviance: log(1 + exp(−2yf)); squared er-ror: (y − f)2; and support vector: (1 − yf)+ (see Sec-tion 12.3). Each function has been scaled so that itpasses through the point (0, 1).


−3 −2 −1 0 1 2 3

02

46

8

Squared ErrorAbsolute ErrorHuber

Loss

y − f

FIGURE 10.5. A comparison of three loss functionsfor regression, plotted as a function of the margin y−f .The Huber loss function combines the good propertiesof squared-error loss near zero and absolute error losswhen |y − f | is large.


!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative Importance


!

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

removeP

artia

l Dep

ende

nce

0.0 0.2 0.4 0.6

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

edu

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.6

-0.2

0.0

0.2

hp

Par

tial D

epen

denc

e

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-1.0

-0.6

-0.2

0.0

0.2

FIGURE 10.7. Partial dependence of log-odds ofspam on four important predictors. The red ticks atthe base of the plots are deciles of the input variable.


0.51.01.52.02.53.0

0.2

0.4

0.6

0.8

1.0

-1.0

-0.5

0.0

0.5

1.0

hp

!

FIGURE 10.8. Partial dependence of the log-odds ofspam vs. email as a function of joint frequencies of hpand the character !.


Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Stumps

10 Node100 NodeAdaboost

FIGURE 10.9. Boosting with different sized trees,applied to the example (10.2) used in Figure 10.2. Sincethe generative model is additive, stumps perform thebest. The boosting algorithm used the binomial devianceloss in Algorithm 10.3; shown for comparison is theAdaBoost Algorithm 10.1.


Coordinate Functions for Additive Logistic Trees

f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)

f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)

FIGURE 10.10. Coordinate functions estimated byboosting stumps for the simulated example used in Fig-ure 10.9. The true quadratic functions are shown forcomparison.


Boosting Iterations

Tes

t Set

Dev

ianc

e

0 500 1000 1500 2000

0.0

0.5

1.0

1.5

2.0

No shrinkageShrinkage=0.2

StumpsDeviance

Boosting Iterations

Tes

t Set

Mis

clas

sific

atio

n E

rror

0 500 1000 1500 20000.

00.

10.

20.

30.

40.

5


StumpsMisclassification Error

Boosting Iterations

Tes

t Set

Dev

ianc

e

0 500 1000 1500 2000

0.0

0.5

1.0

1.5

2.0


6-Node TreesDeviance

Boosting Iterations

Tes

t Set

Mis

clas

sific

atio

n E

rror

0 500 1000 1500 2000

0.0

0.1

0.2

0.3

0.4

0.5


6-Node TreesMisclassification Error

FIGURE 10.11. Test error curves for simulated ex-ample (10.2) of Figure 10.9, using gradient boosting(MART). The models were trained using binomial de-viance, either stumps or six terminal-node trees, andwith or without shrinkage The left panels report test


0 200 400 600 800 1000

0.4

0.6

0.8

1.0

1.2

1.4

Boosting Iterations

Tes

t Set

Dev

ianc

e

Deviance

4−Node Trees

0 200 400 600 800 1000

0.30

0.35

0.40

0.45

0.50

Boosting Iterations

Tes

t Set

Abs

olut

e E

rror

No shrinkageShrink=0.1Sample=0.5Shrink=0.1 Sample=0.5

Absolute Error

FIGURE 10.12. Test-error curves for the simulatedexample (10.2), showing the effect of stochasticity. Forthe curves labeled “Sample= 0.5”, a different 50% sub-sample of the training data was used each time a treewas grown. In the left panel the models were fit by gbm

using a binomial deviance loss function; in the right–hand panel using square-error loss.


0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

Iterations M

Abs

olut

e E

rror

Training and Test Absolute Error

Train ErrorTest Error

FIGURE 10.13. Average-absolute error as a functionof number of iterations for the California housing data.


MedInc

Longitude

AveOccup

Latitude

HouseAge

AveRooms

AveBedrms

Population

0 20 40 60 80 100

Relative importance

FIGURE 10.14. Relative importance of the predic-tors for the California housing data.


MedInc

Par

tial D

epen

denc

e

2 4 6 8 10

-0.5

0.0

0.5

1.0

1.5

2.0

AveOccupP

artia

l Dep

ende

nce

2 3 4 5

-1.0

-0.5

0.0

0.5

1.0

1.5

HouseAge

Par

tial D

epen

denc

e

10 20 30 40 50

-1.0

-0.5

0.0

0.5

1.0

AveRooms

Par

tial D

epen

denc

e

4 6 8 10

-1.0

-0.5

0.0

0.5

1.0

1.5

FIGURE 10.15. Partial dependence of housing valueon the nonlocation variables for the California housingdata. The red ticks at the base of the plot are deciles ofthe input variables.


2

3

4

510

20

30

40

50

0.0

0.5

1.0

AveOccup

HouseAge

FIGURE 10.16. Partial dependence of house valueon median age and average occupancy. There appearsto be a strong interaction effect between these two vari-ables.


−124 −122 −120 −118 −116 −114

3436

3840

42

Longitude

Latit

ude

−1.0

−0.5

0.0

0.5

1.0

FIGURE 10.17. Partial dependence of median housevalue on location in California. One unit is $100, 000,at 1990 prices, and the values plotted are relative to theoverall median of $180, 000.


FIGURE 10.18. Map of New Zealand and its sur-rounding exclusive economic zone showing the loca


0 500 1000 1500

0.24

0.26

0.28

0.30

0.32

0.34

Number of Trees

Mea

n D

evia

nce

GBM TestGBM CVGAM Test

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ityAUC

GAM 0.97GBM 0.98

FIGURE 10.19. The left panel shows the mean de-viance as a function of the number of trees for the GBMlogistic regression model fit to the presence/absencedata. Shown are 10-fold cross-validation on the train-ing data (and 1 × s.e. bars), and test deviance on thetest data. Also shown for comparison is the test de-viance using a GAM model with 8 df for each term.The right panel shows ROC curves on the test data forthe chosen GBM model (vertical line in left plot) andthe GAM model.


OrbVelSpeed

DistanceDisOrgMatterCodendSize

PentadeTidalCurr

SlopeChlaCase2

SSTGradSalResid

SusPartMatterAvgDepth

TempResid

Relative influence

0 10 25 −4 0 2 4 6

−7

−5

−3

−1

TempResid

f(T

empR

esid

)

0 500 1000 2000

−6

−4

−2

AvgDepth

f(A

vgD

epth

)

0 5 10 15

−7

−5

−3

SusPartMatter

f(S

usP

artM

atte

r)

−0.8 −0.4 0.0 0.4

−7

−5

−3

−1

SalResid

f(S

alR

esid

)

0.00 0.05 0.10 0.15

−7

−5

−3

−1

SSTGrad

f(S

ST

Gra

d)

FIGURE 10.20. The top-left panel shows the relativeinfluence computed from the GBM logistic regressionmodel. The remaining panels show the partial depen-dence plots for the leading five variables, all plotted onthe same scale for comparison.


FIGURE 10.21. Geological prediction maps of thepresence probability (left map) and catch size (rightmap) obtained from the gradient boosted models.


Sales

Unemployed

Military

Clerical

Labor

Homemaker

Prof/Man

Retired

Student

0.0 0.2 0.4 0.6 0.8 1.0

Error Rate

Overall Error Rate = 0.425

FIGURE 10.22. Error rate for each occupation inthe demographics data.


age

income

edu

hsld-stat

mar-dlinc

sex

ethnic

mar-stat

typ-home

lang

num-hsld

children

yrs-BA

0 20 40 60 80 100

Relative Importance

FIGURE 10.23. Relative importance of the predic-tors as averaged over all classes for the demographicsdata.


agemar-dlinc

sexethnic

incomehsld-statmar-stat

langtyp-home

childrenedu

num-hsldyrs-BA

0 20 40 60 80 100

Relative Importance

Class = Retired

hsld-statage

incomemar-stat

eduethnic

num-hsldtyp-home

sexmar-dlinc

langyrs-BA

children

0 20 40 60 80 100

Relative Importance

Class = Student

eduincome

agemar-dlinc

ethnichsld-stat

typ-homesex

num-hsldlang

mar-statyrs-BA

children

0 20 40 60 80 100

Relative Importance

Class = Prof/Man

sexmar-dlinc

childrenethnic

num-hsldedu

mar-statlang

typ-homeincome

agehsld-stat

yrs-BA

0 20 40 60 80 100

Relative Importance

Class = Homemaker

FIGURE 10.24. Predictor variable importances sep-arately for each of the four classes with lowest errorrate for the demographics data.


age

Par

tial D

epen

denc

e

1 2 3 4 5 6 7

01

23

4

Retired

age

Par

tial D

epen

denc

e1 2 3 4 5 6 7

-2-1

01

2

Student

age

Par

tial D

epen

denc

e

1 2 3 4 5 6 7

-2-1

01

2

Prof/Man

FIGURE 10.25. Partial dependence of the odds ofthree different occupations on age, for the demographicsdata.

final classifier - web.stanford.eduhastie/elemstatlearnii_figures/figures10.pdf · elements of...

Documents