trees, bagging, random forests and...
TRANSCRIPT
![Page 1: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/1.jpg)
Boosting Trevor Hastie, Stanford University 1
Trees, Bagging, Random Forests and Boosting
• Classification Trees
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
Methods for improving the performance of weak learners such asTrees. Classification trees are adaptive and robust, but do notgeneralize well. The techniques discussed here enhance theirperformance considerably.
![Page 2: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/2.jpg)
Boosting Trevor Hastie, Stanford University 2
Two-class Classification
• Observations are classified into two or more classes, coded by aresponse variable Y taking values 1, 2, . . . , K.
• We have a feature vector X = (X1, X2, . . . , Xp), and we hopeto build a classification rule C(X) to assign a class label to anindividual with feature X.
• We have a sample of pairs (yi, xi), i = 1, . . . , N . Note thateach of the xi are vectors xi = (xi1, xi2, . . . , xip).
• Example: Y indicates whether an email is spam or not. X
represents the relative frequency of a subset of specially chosenwords in the email message.
• The technology described here estimates C(X) directly, or viathe probability function P (C = k|X).
![Page 3: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/3.jpg)
Boosting Trevor Hastie, Stanford University 3
Classification Trees
• Represented by a series of binary splits.
• Each internal node represents a value query on one of thevariables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right,else go left.
• The terminal nodes are the decision nodes. Typically eachterminal node is dominated by one of the classes.
• The tree is grown using training data, by recursive splitting.
• The tree is often pruned to an optimal size, evaluated bycross-validation.
• New observations are classified by passing their X down to aterminal node of the tree, and then using majority vote.
![Page 4: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/4.jpg)
Boosting Trevor Hastie, Stanford University 4
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Classification Tree
![Page 5: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/5.jpg)
Boosting Trevor Hastie, Stanford University 5
Properties of Trees
✔ Can handle huge datasets
✔ Can handle mixed predictors—quantitative and qualitative
✔ Easily ignore redundant variables
✔ Handle missing data elegantly
✔ Small trees are easy to interpret
✖ large trees are hard to interpret
✖ Often prediction performance is poor
![Page 6: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/6.jpg)
Boosting Trevor Hastie, Stanford University 6
Example: Predicting e-mail spam
• data from 4601 email messages
• goal: predict whether an email message is spam (junk email) orgood.
• input features: relative frequencies in a message of 57 of themost commonly occurring words and punctuation marks in allthe training the email messages.
• for this problem not all errors are equal; we want to avoidfiltering out good email, while letting spam get through is notdesirable but less serious in its consequences.
• we coded spam as 1 and email as 0.
• A system like this would be trained for each user separately(e.g. their word lists would be different)
![Page 7: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/7.jpg)
Boosting Trevor Hastie, Stanford University 7
Predictors
• 48 quantitative predictors—the percentage of words in theemail that match a given word. Examples include business,address, internet, free, and george. The idea was that thesecould be customized for individual users.
• 6 quantitative predictors—the percentage of characters in theemail that match a given character. The characters are ch;,ch(, ch[, ch!, ch$, and ch#.
• The average length of uninterrupted sequences of capitalletters: CAPAVE.
• The length of the longest uninterrupted sequence of capitalletters: CAPMAX.
• The sum of the length of uninterrupted sequences of capitalletters: CAPTOT.
![Page 8: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/8.jpg)
Boosting Trevor Hastie, Stanford University 8
Details
• A test set of size 1536 was randomly chosen, leaving 3065observations in the training set.
• A full tree was grown on the training set, with splittingcontinuing until a minimum bucket size of 5 was reached.
• This bushy tree was pruned back using cost-complexitypruning, and the tree size was chosen by 10-foldcross-validation.
• We then compute the test error and ROC curve on the testdata.
![Page 9: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/9.jpg)
Boosting Trevor Hastie, Stanford University 9
Some important features
39% of the training data were spam.
Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen thewords and characters showing the largest difference between spam
and email.
george you your hp free hpl
spam 0.00 2.26 1.38 0.02 0.52 0.01
email 1.27 1.27 0.44 0.90 0.07 0.43
! our re edu remove
spam 0.51 0.51 0.13 0.01 0.28
email 0.11 0.18 0.42 0.29 0.01
![Page 10: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/10.jpg)
Boosting Trevor Hastie, Stanford University 10
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
![Page 11: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/11.jpg)
Boosting Trevor Hastie, Stanford University 11
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for pruned tree on SPAM data
o TREE − Error: 8.7%
SPAM Data
Overall error rate on test data:8.7%.ROC curve obtained by vary-ing the threshold c0 of the clas-sifier:C(X) = +1 if P (+1|X) > c0.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.
We may want specificity to be high, and suffer some spam:Specificity : 95% =⇒ Sensitivity : 79%
![Page 12: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/12.jpg)
Boosting Trevor Hastie, Stanford University 12
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE vs SVM on SPAM data
oo
SVM − Error: 6.7%TREE − Error: 8.7%
TREE vs SVM
Comparing ROC curves onthe test data is a goodway to compare classi-fiers. SVM dominatesTREE here.
![Page 13: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/13.jpg)
Boosting Trevor Hastie, Stanford University 13
Toy Classification Problem
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
Bayes Error Rate: 0.25
0
00
0
00
0
0
00
0
0
0
0
0
0
0
0
0 0 0
00
0
00
00
00
0
0
00
00
00
0 0
0
0
00
0
0
0 0
00
0
0
0
000
0
0 0
00
00
00 00 0000
0
0
0
0
0
0
0
00
00
0
0
00
00
0
0
0
000
00
00
0
000 00
0
0
0
0
0
00
00
0
00
0
00
00
0
0
0
0
0
0
0
0
00
00
0
0
0
0
00
0
0
0 00
0
0 00
00
00
0
0
0
0
0 0
0
0
000
0
0
0
0
00
0
0
000
0
00
0
0
0
0 00
0
00 00 0
0 0
000
00
0
0000 0
0
0
0
0
00 0
000
0
0
0
0
00
0
00
0
0
0
0
0
0
0
0 00
0
00
00
0
0
0
0 0
00
00
0 0
0
0
0
00
0
00
0
0
000
0
0
0
00
0
00 00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
00
000
0
0
00
00
0
00
0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
00
00
0
0
00
00 0
0
0
0
0 0
0
0
00
000
0
0 0
00
00
0 0
0 0
0
0
0
0
00
0
0
0
0
000
0
00
0 0
0
0
00
00
0 0
0
0
0
0
0
0
0
0
0
0 00
00
0
00
0 0
00
00
1
1
1
11
1
1
1
11
1
111
1
1
11
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
11
1
1
11
1
1 11
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
11
1
1
11
1
1
11
11
11
1
11
11
1
1
1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
111
1
1
1
11
11
11
1
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
11
111
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
11
1
1
1
1
1
11
1
1
1
11
1
1
1 1
1
1
1
1
1
1
1
1
11
1
1
1 1
1
1
11
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
11
1
11
1
X2
X1
• Data X and Y , with Y
taking values +1 or −1.
• Here X = (X1, X2)
• The black boundaryis the Bayes DecisionBoundary - the bestone can do.
• Goal: Given N train-ing pairs (Xi, Yi)produce a classifierC(X) ∈ {−1, 1}
• Also estimate the probability of the class labels P (Y = +1|X).
![Page 14: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/14.jpg)
Boosting Trevor Hastie, Stanford University 14
Toy Example - No Noise
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
0
0
00
0 0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
000
0
0
0
0
0
0
0
0
0
00
00 0
0
00
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
00
00
0
0
0
0
0
0
00
0
0
0
0
00
0
0
000
0
0
00
0
0
0
0 0
00
0
0 00
0
0
0
00
00 0
0
0
00
0
0
0
0
0
00
0
00
00 0
0
000
0
0
0
00
0 00
0
0
0
0
0
0
0
0
0
00
0
0 0
0
0
0
00
0
00
00
0
0
00
0
0
0
0
00
0 0
0
00
0
0
0
0 0
0
0
0
0
00
0
000
000
0
00 0
0
0
0
0 00
0
0
0
00
00
0
0
0
0
0
00
0
00
0
00
0
0
00
00
0
0 0
0
0
0
0
0
0
00
0
00
00
00
0
0
00
0
00
0
0
00
0
00
0 0
0
0
0
0
0
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0 00
00
0
0
0
0
0
00
00
0
0
00
0
00
00
0
0
00
00
0
000 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 00
0
0
00
00 00
00
00
00
0
0
0
0
0
0
0
0 0
0
0
0
0
00
0
00
0
000
00
00
0
0
0
1
1
1
1
1
1
11
11
11
1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
11
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
11
1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 11
1
1
11
11
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1 11 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
11
1
1 1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
Bayes Error Rate: 0
X2
X1
• Deterministic problem;noise comes from sam-pling distribution of X.
• Use a training sampleof size 200.
• Here Bayes Error is0%.
![Page 15: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/15.jpg)
Boosting Trevor Hastie, Stanford University 15
Classification Tree
x.2<-1.06711x.2>-1.06711
94/200
1
0/34
1
x.2<1.14988x.2>1.14988
72/166
0
x.1<1.13632x.1>1.13632
40/134
0
x.1<-0.900735x.1>-0.900735
23/117
0
x.1<-1.1668x.1>-1.1668
5/26
1
0/12
1
x.1<-1.07831x.1>-1.07831
5/14
1
1/5
1
4/9
1
x.2<-0.823968x.2>-0.823968
2/91
0
2/8
0
0/83
0
0/17
1
0/32
1
![Page 16: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/16.jpg)
Boosting Trevor Hastie, Stanford University 16
Decision Boundary: Tree
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.073
When the nested spheresare in 10-dimensions, Clas-sification Trees produces arather noisy and inaccuraterule C(X), with error ratesaround 30%.
![Page 17: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/17.jpg)
Boosting Trevor Hastie, Stanford University 17
Model Averaging
Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting � Random Forests � Bagging � Single Tree.
![Page 18: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/18.jpg)
Boosting Trevor Hastie, Stanford University 18
Bagging
Bagging or bootstrap aggregation averages a given procedure overmany samples, to reduce its variance — a poor man’s Bayes. See
pp 246.
Suppose C(S, x) is a classifier, such as a tree, based on our trainingdata S, producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size N
with replacement from the training data. Then
Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.
Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction. Howeverany simple structure in C (e.g a tree) is lost.
![Page 19: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/19.jpg)
Boosting Trevor Hastie, Stanford University 19
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Original Tree
x.2<0.36x.2>0.36
7/30
0
x.1<-0.965x.1>-0.965
1/23
0
1/5
0
0/18
0
1/7
1
Bootstrap Tree 1
x.2<0.39x.2>0.39
11/30
0
x.3<-1.575x.3>-1.575
3/22
0
2/5
1
0/17
0
0/8
1
Bootstrap Tree 2
x.4<0.395x.4>0.395
4/30
0
x.3<-1.575x.3>-1.575
2/25
0
2/5
0
0/20
0
2/5
0
Bootstrap Tree 3
x.2<0.255x.2>0.255
13/30
0
x.3<-1.385x.3>-1.385
2/16
0
2/5
0
0/11
0
3/14
1
Bootstrap Tree 4
x.2<0.38x.2>0.38
12/30
0
x.3<-1.61x.3>-1.61
4/20
0
2/6
1
0/14
0
2/10
1
Bootstrap Tree 5
![Page 20: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/20.jpg)
Boosting Trevor Hastie, Stanford University 20
Decision Boundary: Bagging
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.032
Bagging averages manytrees, and producessmoother decision bound-aries.
![Page 21: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/21.jpg)
Boosting Trevor Hastie, Stanford University 21
Random forests
• refinement of bagged trees; quite popular
• at each tree split, a random sample of m features is drawn, andonly those m features are considered for splitting. Typicallym =
√p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.
• random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.
![Page 22: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/22.jpg)
Boosting Trevor Hastie, Stanford University 22
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE, SVM and Random Forest on SPAM data
ooo
Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%
TREE, SVM and RF
Random Forest dominatesboth other methods on theSPAM data — 5.0% error.Used 500 trees with defaultsettings for random Forest
package in R.
![Page 23: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/23.jpg)
Boosting Trevor Hastie, Stanford University 23
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
CM (x)
C3(x)
C2(x)
C1(x)
Boosting
• Average many trees, eachgrown to re-weighted versionsof the training data.
• Final Classifier is weighted av-erage of classifiers:
C(x) = sign[∑M
m=1 αmCm(x)]
![Page 24: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/24.jpg)
Boosting Trevor Hastie, Stanford University 24
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Bagging
AdaBoost
100 Node Trees
Boosting vs Bagging
• 2000 points fromNested Spheres in R10
• Bayes error rate is 0%.
• Trees are grown bestfirst without pruning.
• Leftmost term is a sin-gle tree.
![Page 25: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/25.jpg)
Boosting Trevor Hastie, Stanford University 25
AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm(x) to the training data using weights wi.
(b) Compute weighted error of newest tree
errm =∑N
i=1 wiI(yi �= Cm(xi))∑Ni=1 wi
.
(c) Compute αm = log[(1 − errm)/errm].
(d) Update weights for i = 1, . . . , N :wi ← wi · exp[αm · I(yi �= Cm(xi))]and renormalize to wi to sum to 1.
3. Output C(x) = sign[∑M
m=1 αmCm(x)].
![Page 26: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/26.jpg)
Boosting Trevor Hastie, Stanford University 26
Boosting Iterations
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Single Stump
400 Node Tree
Boosting Stumps
A stump is a two-nodetree, after a single split.Boosting stumps worksremarkably well on thenested-spheres problem.
![Page 27: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/27.jpg)
Boosting Trevor Hastie, Stanford University 27
Number of Terms
Tra
in a
nd T
est E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5 Training Error
• Nested spheres in 10-Dimensions.
• Bayes error is 0%.
• Boosting drives thetraining error to zero.
• Further iterations con-tinue to improve testerror in many exam-ples.
![Page 28: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/28.jpg)
Boosting Trevor Hastie, Stanford University 28
Number of Terms
Tra
in a
nd T
est E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5
Bayes Error
Noisy Problems
• Nested Gaussians in10-Dimensions.
• Bayes error is 25%.
• Boosting with stumps
• Here the test errordoes increase, but quiteslowly.
![Page 29: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/29.jpg)
Boosting Trevor Hastie, Stanford University 29
Stagewise Additive Modeling
Boosting builds an additive model
f(x) =M∑
m=1
βmb(x; γm).
Here b(x, γm) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: f(x) =∑
j fj(xj)
• Basis expansions: f(x) =∑M
m=1 θmhm(x)
Traditionally the parameters fm, θm are fit jointly (i.e. leastsquares, maximum likelihood).
With boosting, the parameters (βm, γm) are fit in a stagewisefashion. This slows the process down, and overfits less quickly.
![Page 30: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/30.jpg)
Boosting Trevor Hastie, Stanford University 30
Additive Trees
• Simple example: stagewise least-squares?
• Fix the past M − 1 functions, and update the Mth using a tree:
minfM∈Tree(x)
E(Y −M−1∑m=1
fm(x) − fM (x))2
• If we define the current residuals to be
R = Y −M−1∑m=1
fm(x)
then at each stage we fit a tree to the residuals
minfM∈Tree(x)
E(R − fM (x))2
![Page 31: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/31.jpg)
Boosting Trevor Hastie, Stanford University 31
Stagewise Least Squares
Suppose we have available a basis family b(x; γ) parametrized by γ.
• After m − 1 steps, suppose we have the modelfm−1(x) =
∑m−1j=1 βjb(x; γj).
• At the mth step we solve
minβ,γ
N∑i=1
(yi − fm−1(xi) − βb(xi; γ))2
• Denoting the residuals at the mth stage byrim = yi − fm−1(xi), the previous step amounts to
minβ,γ
(rim − βb(xi; γ))2,
• Thus the term βmb(x; γm) that best fits the current residuals isadded to the expansion at each step.
![Page 32: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/32.jpg)
Boosting Trevor Hastie, Stanford University 32
Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
f(x) = logPr(Y = 1|x)
Pr(Y = −1|x)=
M∑m=1
αmGm(x)
by stagewise fitting using the loss function
L(y, f(x)) = exp(−y f(x)).
• Given the current fM−1(x), our solution for (βm, Gm) is
arg minβ,G
N∑i=1
exp[−yi(fm−1(xi) + β G(x))]
where Gm(x) ∈ {−1, 1} is a tree classifier and βm is acoefficient.
![Page 33: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/33.jpg)
Boosting Trevor Hastie, Stanford University 33
• With w(m)i = exp(−yi fm−1(xi)), this can be re-expressed as
arg minβ,G
N∑i=1
w(m)i exp(−β yi G(xi))
• We can show that this leads to the Adaboost algorithm; See
pp 305.
![Page 34: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/34.jpg)
Boosting Trevor Hastie, Stanford University 34
-2 -1 0 1 2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector
Loss
y · f
Why Exponential Loss?
• e−yF (x) is a monotone,smooth upper bound onmisclassification loss at x.
• Leads to simple reweightingscheme.
• Has logit transform as popu-lation minimizer
f∗(x) =12
logPr(Y = 1|x)
Pr(Y = −1|x)
• Other more robust loss func-tions, like binomial deviance.
![Page 35: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/35.jpg)
Boosting Trevor Hastie, Stanford University 35
General Stagewise Algorithm
We can do the same for more general loss functions, not only leastsquares.
1. Initialize f0(x) = 0.
2. For m = 1 to M :
(a) Compute(βm, γm) = arg minβ,γ
∑Ni=1 L(yi, fm−1(xi) + βb(xi; γ)).
(b) Set fm(x) = fm−1(x) + βmb(x; γm).
Sometimes we replace step (b) in item 2 by
(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)
Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows thestagewise model-building even more, and typically leads to betterperformance.
![Page 36: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/36.jpg)
Boosting Trevor Hastie, Stanford University 36
Gradient Boosting
• General boosting algorithm that works with a variety ofdifferent loss functions. Models include regression, resistantregression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, forrepresenting the logits in logistic regression.
• Tree size is a parameter that determines the order ofinteraction (next slide).
• Gradient Boosting inherits all the good features of trees(variable selection, missing data, mixed predictors), andimproves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in , section 10.10.
![Page 37: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/37.jpg)
Boosting Trevor Hastie, Stanford University 37
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Stumps
10 Node100 NodeAdaboost
Tree Size
The tree size J determinesthe interaction order of themodel:
η(X) =∑
j
ηj(Xj)
+∑jk
ηjk(Xj , Xk)
+∑jkl
ηjkl(Xj , Xk, Xl)
+ · · ·
![Page 38: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/38.jpg)
Boosting Trevor Hastie, Stanford University 38
Stumps win!
Since the true decision boundary is the surface of a sphere, thefunction that describes it has the form
f(X) = X21 + X2
2 + . . . + X2p − c = 0.
Boosted stumps via Gradient Boosting returns reasonableapproximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees
f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)
f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)
![Page 39: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/39.jpg)
Boosting Trevor Hastie, Stanford University 39
Spam Example Results
With 3000 training and 1500 test observations, Gradient Boostingfits an additive logistic model
f(x) = logPr(spam|x)Pr(email|x)
using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4%, compared to 5.3% foran additive GAM, 5.0% for Random Forests, and 8.7% for CART.
![Page 40: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/40.jpg)
Boosting Trevor Hastie, Stanford University 40
Spam: Variable Importance
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative importance
![Page 41: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/41.jpg)
Boosting Trevor Hastie, Stanford University 41
Spam: Partial Dependence
!
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
remove
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
edu
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
-0.6
-0.2
0.0
0.2
hp
Par
tial D
epen
denc
e
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-1.0
-0.6
-0.2
0.0
0.2
![Page 42: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/42.jpg)
Boosting Trevor Hastie, Stanford University 42
Comparison of Learning Methods
Some characteristics of different learning methods.
Key: ● = good, ● =fair, and ● =poor.
Characteristic NeuralNets
SVM CART GAM KNN,Kernel
GradientBoost
Natural handling of dataof “mixed” type ● ● ● ● ● ●
Handling of missing val-ues ● ● ● ● ● ●
Robustness to outliers ininput space ● ● ● ● ● ●
Insensitive to monotonetransformations of in-puts
● ● ● ● ● ●
Computational scalabil-ity (large N) ● ● ● ● ● ●
Ability to deal with irrel-evant inputs ● ● ● ● ● ●
Ability to extract linearcombinations of features ● ● ● ● ● ●
Interpretability● ● ● ● ● ●
Predictive power● ● ● ● ● ●
![Page 43: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/43.jpg)
Boosting Trevor Hastie, Stanford University 43
Software
• R: free GPL statistical computing environment available fromCRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classificationand regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’sgradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, randomforests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.Includes Trees, Random Forests and many other procedures.
![Page 44: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/44.jpg)
Ensembles
Leon Bottou
COS 424 – 4/8/2010
![Page 45: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/45.jpg)
Readings
• T. G. Dietterich (2000)
“Ensemble Methods in Machine Learning”.
• R. E. Schapire (2003):
“The Boosting Approach to Machine Learning”.
Sections 1,2,3,4,6.
Leon Bottou 2/33 COS 424 – 4/8/2010
![Page 46: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/46.jpg)
Summary
1. Why ensembles?
2. Combining outputs.
3. Constructing ensembles.
4. Boosting.
Leon Bottou 3/33 COS 424 – 4/8/2010
![Page 47: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/47.jpg)
I. Ensembles
Leon Bottou 4/33 COS 424 – 4/8/2010
![Page 48: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/48.jpg)
Ensemble of classifiers
Ensemble of classifiers
– Consider a set of classifiers h1, h2, . . . , hL.
– Construct a classifier by combining their individual decisions.
– For example by voting their outputs.
Accuracy
– The ensemble works if the classifiers have low error rates.
Diversity
– No gain if all classifiers make the same mistakes.
– What if classifiers make different mistakes?
Leon Bottou 5/33 COS 424 – 4/8/2010
![Page 49: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/49.jpg)
Uncorrelated classifiers
Assume ∀r 6= s Cov [ 1I{hr(x) = y} , 1I{hs(x) = y} ] = 0
The tally of classifier votes follows a binomial distribution.
ExampleTwenty-one uncorrelated classifiers with 30% error rate.
Leon Bottou 6/33 COS 424 – 4/8/2010
![Page 50: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/50.jpg)
Statistical motivation
blue : classifiers that work well on the training set(s)f : best classifier.
Leon Bottou 7/33 COS 424 – 4/8/2010
![Page 51: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/51.jpg)
Computational motivation
blue : classifier search may reach local optimaf : best classifier.
Leon Bottou 8/33 COS 424 – 4/8/2010
![Page 52: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/52.jpg)
Representational motivation
blue : classifier space may not contain best classifierf : best classifier.
Leon Bottou 9/33 COS 424 – 4/8/2010
![Page 53: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/53.jpg)
Practical success
Recommendation system
– Netflix “movies you may like”.
– Customers sometimes rate movies they rent.
– Input: (movie, customer)
– Output: rating
Netflix competition
– 1M$ for the first team to do 10% better than their system.
Winner: BellKor team and friends
– Ensemble of more than 800 rating systems.
Runner-up: everybody else
– Ensemble of all the rating systems built by the other teams.
Leon Bottou 10/33 COS 424 – 4/8/2010
![Page 54: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/54.jpg)
Bayesian ensembles
Let D represent the training data.
Enumerating all the classifiers
P (y|x,D) =∑h
P (y, h|x,D)
=∑h
P (h|x,D) P (y|h, x,D)
=∑h
P (h|D) P (y|x, h)
P (h|D) : how well does h match the training data.
P (y|x, h) : what h predicts for pattern x.
Note that this is a weighted average.
Leon Bottou 11/33 COS 424 – 4/8/2010
![Page 55: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/55.jpg)
II. Combining Outputs
Leon Bottou 12/33 COS 424 – 4/8/2010
![Page 56: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/56.jpg)
Simple averaging
��
��
��
���
� �
Leon Bottou 13/33 COS 424 – 4/8/2010
![Page 57: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/57.jpg)
Weighted averaging a priori
��
��
��
���
� �
���� ����� ��������������������������
Weights derived from the training errors, e.g. exp(−β TrainingError(ht)).Approximate Bayesian ensemble.
Leon Bottou 14/33 COS 424 – 4/8/2010
![Page 58: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/58.jpg)
Weighted averaging with trained weights
��
��
��
���
� �
���� ��������������
���������������
Train weights on the validation set.Training weights on the training set overfits easily.You need another validation set to estimate the performance!
Leon Bottou 15/33 COS 424 – 4/8/2010
![Page 59: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/59.jpg)
Stacked classifiers
��
��
��
���
�
������ ������
� ���������
� ��� �������
Second tier classifier trained on the validation set.
You need another validation set to estimate the performance!
Leon Bottou 16/33 COS 424 – 4/8/2010
![Page 60: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/60.jpg)
III. Constructing Ensembles
Leon Bottou 17/33 COS 424 – 4/8/2010
![Page 61: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/61.jpg)
Diversification
Cause of the mistake Diversification strategy
Pattern was difficult. hopeless
Overfitting (?) vary the training sets
Some features were noisy vary the set of input features
Multiclass decisions were inconsistent vary the class encoding
Leon Bottou 18/33 COS 424 – 4/8/2010
![Page 62: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/62.jpg)
Manipulating the training examples
Bootstrap replication simulates training set selection
– Given a training set of size n, construct a new training set
by sampling n examples with replacement.
– About 30% of the examples are excluded.
Bagging
– Create bootstrap replicates of the training set.
– Build a decision tree for each replicate.
– Estimate tree performance using out-of-bootstrap data.
– Average the outputs of all decision trees.
Boosting
– See part IV.
Leon Bottou 19/33 COS 424 – 4/8/2010
![Page 63: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/63.jpg)
Manipulating the features
Random forests
– Construct decision trees on bootstrap replicas.
Restrict the node decisions to a small subset of features
picked randomly for each node.
– Do not prune the trees.
Estimate tree performance using out-of-bootstrap data.
Average the outputs of all decision trees.
Multiband speech recognition
– Filter speech to eliminate a random subset of the frequencies.
– Train speech recognizer on filtered data.
– Repeat and combine with a second tier classifier.
– Resulting recognizer is more robust to noise.
Leon Bottou 20/33 COS 424 – 4/8/2010
![Page 64: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/64.jpg)
Manipulating the output codes
Reducing multiclass problems to binary classification
– We have seen one versus all.
– We have seen all versus all.
Error correcting codes for multiclass problems
– Code the class numbers with an error correcting code.
– Construct a binary classifier for each bit of the code.
– Run the error correction algorithm on the binary classifier outputs.
Leon Bottou 21/33 COS 424 – 4/8/2010
![Page 65: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/65.jpg)
IV. Boosting
Leon Bottou 22/33 COS 424 – 4/8/2010
![Page 66: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/66.jpg)
Motivation
• Easy to come up with rough rules of thumb for classifying data
– email contains more than 50% capital letters.
– email contains expression “buy now”.
• Each alone isnt great, but better than random.
• Boosting converts rough rules of thumb into an accurate classier.
Boosting was invented by Prof. Schapire.
Leon Bottou 23/33 COS 424 – 4/8/2010
![Page 67: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/67.jpg)
Adaboost
Given examples (x1, y1) . . . (xn, yn) with yi = ±1.
Let D1(i) = 1/n for i = 1 . . . n.
For t = 1 . . . T do
• Run weak learner using examples with weights Dt.
• Get weak classifier ht• Compute error: εt =
∑iDt(i) 1I(ht(xi) 6= yi)
• Compute magic coefficient αt =1
2log
(1− εtεt
)• Update weights Dt+1(i) =
Dt(i) e−αt yi ht(xi)
Zt
Output the final classifier fT (x) = sign
T∑t=1
αtht(x)
Leon Bottou 24/33 COS 424 – 4/8/2010
![Page 68: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/68.jpg)
Toy example
Weak classifiers: vertical or horizontal half-planes.
Leon Bottou 25/33 COS 424 – 4/8/2010
![Page 69: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/69.jpg)
Adaboost round 1
Leon Bottou 26/33 COS 424 – 4/8/2010
![Page 70: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/70.jpg)
Adaboost round 2
Leon Bottou 27/33 COS 424 – 4/8/2010
![Page 71: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/71.jpg)
Adaboost round 3
Leon Bottou 28/33 COS 424 – 4/8/2010
![Page 72: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/72.jpg)
Adaboost final classifier
Leon Bottou 29/33 COS 424 – 4/8/2010
![Page 73: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/73.jpg)
From weak learner to strong classifier (1)
Preliminary
DT+1(i) = D1(i)e−α1 yi h1(xi)
Z1· · · e
−αT yi hT (xi)
ZT=
1
n
e−yi fT (xi)∏tZt
Bounding the training error
1
n
∑i
1I{fT (xi) 6= yi} ≤1
n
∑i
e−yi fT (xi) =1
n
∑i
DT+1(i)∏t
Zt =∏t
Zt
Idea: make Zt as small as possible.
Zt =
n∑i=1
Dt(i)e−αt yi ht(xi) = n (1− εt) e−αt + n εt e
αt
1. Pick ht to minimize εt.
2. Pick αt to minimize Zt.
Leon Bottou 30/33 COS 424 – 4/8/2010
![Page 74: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/74.jpg)
From weak learner to strong classifier (2)
Pick αt to minimize Zt (the magic coefficient)
∂Zt∂αt
= −(1− εt) e−αt + εt eαt = 0 =⇒ αt =
1
2log
1− εtεt
Weak learner assumption: γt = 12 − εt is positive and small.
Zt = (1− ε)√
ε
1− ε+ ε
√1− εε
=√
4ε(1− ε) =√
1− 4γ2t ≤ exp
(− 2γ2
t
)
TrainingError(fT ) ≤T∏t=1
Zt ≤ exp
−2
T∑t=1
γ2t
The training error decreases exponentially if inf γt > 0.
But that does not happen beyond a certain point. . .
Leon Bottou 31/33 COS 424 – 4/8/2010
![Page 75: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/75.jpg)
Boosting and exponential loss
Proofs are instructive
We obtain the bound
TrainingError(fT ) ≤ 1
n
∑i
e−yiH(xi) =
T∏t=1
Zt
– without saying how Dt relates to ht– without using the value of αt
y y(x)^
Conclusion
– Round T chooses the hT and αTthat maximize the exponential loss reduction from fT−1 to fT .
Exercise
– Tweak Adaboost to minimize the log loss instead of the exp loss.
Leon Bottou 32/33 COS 424 – 4/8/2010
![Page 76: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/76.jpg)
Boosting and margins
marginH(x, y) =y H(x)∑t |αt|
=
∑t αt y ht(x)∑
t |αt|
Remember support vector machines?
Leon Bottou 33/33 COS 424 – 4/8/2010
![Page 77: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/77.jpg)
Ensemble learning Lecture 12
David Sontag New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola
![Page 78: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/78.jpg)
Ensemble methods Machine learning competition with a $1 million prize
![Page 79: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/79.jpg)
3
Bias/Variance Tradeoff
Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001
![Page 80: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/80.jpg)
4
Reduce Variance Without Increasing Bias
• Averaging reduces variance:
Average models to reduce model variance
One problem: only one training set
where do multiple models come from?
(when predictions are independent)
![Page 81: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/81.jpg)
5
Bagging: Bootstrap AggregaGon
• Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D’ by drawing N examples at random with replacement from D.
• Bagging: – Create k bootstrap samples D1 … Dk. – Train disGnct classifier on each Di. – Classify new instance by majority vote / average.
![Page 82: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/82.jpg)
6
Bagging
• Best case:
In practice: models are correlated, so reduction is smaller than 1/N
variance of models trained on fewer training cases usually somewhat larger
![Page 83: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/83.jpg)
7
![Page 84: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/84.jpg)
8
decision tree learning algorithm; very similar to ID3
![Page 85: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/85.jpg)
shades of blue/red indicate strength of vote for particular classification
![Page 86: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/86.jpg)
10
Reduce Bias2 and Decrease Variance?
• Bagging reduces variance by averaging • Bagging has liZle effect on bias • Can we average and reduce bias? • Yes:
• BoosGng
![Page 87: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/87.jpg)
Theory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of Boosting
Rob Schapire
![Page 88: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/88.jpg)
Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”[Gorin et al.]
• goal: automatically categorize type of call requested by phonecustomer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)• operator I need to make a call but I need to bill
it to my office (ThirdNumber)• yes I’d like to place a call on my master card
please (CallingCard)• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
• observation:• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utteranceTHEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
![Page 89: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/89.jpg)
The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach
• devise computer program for deriving rough rules of thumb
• apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb
• repeat T times
![Page 90: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/90.jpg)
Key DetailsKey DetailsKey DetailsKey DetailsKey Details
• how to choose examples on each round?• concentrate on “hardest” examples(those most often misclassified by previous rules ofthumb)
• how to combine rules of thumb into single prediction rule?• take (weighted) majority vote of rules of thumb
![Page 91: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/91.jpg)
BoostingBoostingBoostingBoostingBoosting
• boosting = general method of converting rough rules ofthumb into highly accurate prediction rule
• technically:• assume given “weak” learning algorithm that canconsistently find classifiers (“rules of thumb”) at leastslightly better than random, say, accuracy ! 55%(in two-class setting) [ “weak learning assumption” ]
• given su!cient data, a boosting algorithm can provablyconstruct single classifier with very high accuracy, say,99%
![Page 92: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/92.jpg)
Preamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early History
![Page 93: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/93.jpg)
Strong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak Learnability
• boosting’s roots are in “PAC” learning model [Valiant ’84]
• get random examples from unknown, arbitrary distribution
• strong PAC learning algorithm:
• for any distributionwith high probabilitygiven polynomially many examples (and polynomial time)can find classifier with arbitrarily small generalizationerror
• weak PAC learning algorithm
• same, but generalization error only needs to be slightlybetter than random guessing (12 " !)
• [Kearns & Valiant ’88]:• does weak learnability imply strong learnability?
![Page 94: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/94.jpg)
If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...
• can use (fairly) wild guesses to produce highly accuratepredictions
• if can learn “part way” then can learn “all the way”
• should be able to improve any learning algorithm
• for any learning problem:• either can always learn with nearly perfect accuracy• or there exist cases where cannot learn even slightlybetter than random guessing
![Page 95: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/95.jpg)
First Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting Algorithms
• [Schapire ’89]:• first provable boosting algorithm
• [Freund ’90]:• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:• first experiments using boosting• limited by practical drawbacks
• [Freund & Schapire ’95]:• introduced “AdaBoost” algorithm• strong practical advantages over previous boostingalgorithms
![Page 96: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/96.jpg)
Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]
• problem: find faces in photograph or movie
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
![Page 97: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/97.jpg)
Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test errorand the margins theory
• experiments and applications
![Page 98: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/98.jpg)
Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test errorand the margins theory
• experiments and applications
![Page 99: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/99.jpg)
A Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of Boosting
• given training set (x1, y1), . . . , (xm, ym)
• yi # {"1,+1} correct label of instance xi # X
• for t = 1, . . . ,T :• construct distribution Dt on {1, . . . ,m}
• find weak classifier (“rule of thumb”)
ht : X $ {"1,+1}
with error "t on Dt :
"t = Pri!Dt [ht(xi ) %= yi ]
• output final/combined classifier Hfinal
![Page 100: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/100.jpg)
AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost[with Freund]
• constructing Dt :
• D1(i) = 1/m• given Dt and ht :
Dt+1(i) =Dt(i)
Zt&
!
e"!t if yi = ht(xi )e!t if yi %= ht(xi )
=Dt(i)
Ztexp("#t yi ht(xi ))
where Zt = normalization factor
#t =1
2ln
"1" "t"t
#
> 0
• final classifier:
• Hfinal(x) = sign
$
%
t
#tht(x)
&
![Page 101: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/101.jpg)
Toy ExampleToy ExampleToy ExampleToy ExampleToy Example
D1
weak classifiers = vertical or horizontal half-planes
![Page 102: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/102.jpg)
Round 1Round 1Round 1Round 1Round 1
h1
α
ε11
=0.30=0.42
2D
![Page 103: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/103.jpg)
Round 2Round 2Round 2Round 2Round 2
α
ε22
=0.21=0.65
h2 3D
![Page 104: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/104.jpg)
Round 3Round 3Round 3Round 3Round 3
h3
α
ε33=0.92=0.14
![Page 105: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/105.jpg)
Final ClassifierFinal ClassifierFinal ClassifierFinal ClassifierFinal Classifier
Hfinal
+ 0.92+ 0.650.42sign=
=
![Page 106: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/106.jpg)
Voted combination of classifiers• The general problem here is to try to combine many simple
“weak” classifiers into a single “strong” classifier
• We consider voted combinations of simple binary ±1component classifiers
hm(x) = α1 h(x; θ1) + . . . + αm h(x; θm)
where the (non-negative) votes αi can be used to emphasizecomponent classifiers that are more reliable than others
Tommi Jaakkola, MIT CSAIL 3
![Page 107: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/107.jpg)
Components: decision stumps• Consider the following simple family of component classifiers
generating ±1 labels:
h(x; θ) = sign( w1 xk − w0 )
where θ = {k, w1, w0}. These are called decision stumps.
• Each decision stump pays attention to only a singlecomponent of the input vector
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT CSAIL 4
![Page 108: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/108.jpg)
Voted combination cont’d• We need to define a loss function for the combination so
we can determine which new component h(x; θ) to add andhow many votes it should receive
hm(x) = α1h(x; θ1) + . . . + αmh(x; θm)
• While there are many options for the loss function we considerhere only a simple exponential loss
exp{−y hm(x) }
Tommi Jaakkola, MIT CSAIL 5
![Page 109: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/109.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 6
![Page 110: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/110.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
=n�
i=1
exp{−yihm−1(xi)}� �� �fixed at stage m
exp{−yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 7
![Page 111: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/111.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
=n�
i=1
exp{−yihm−1(xi)}� �� �fixed at stage m
exp{−yiαmh(xi; θm) }
=n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) }
So at the mth iteration the new component (and the votes)should optimize a weighted loss (weighted towards mistakes).
Tommi Jaakkola, MIT CSAIL 8
![Page 112: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/112.jpg)
Empirical exponential loss cont’d• To increase modularity we’d like to further decouple the
optimization of h(x; θm) from the associated votes αm
• To this end we select h(x; θm) that optimizes the rate atwhich the loss would decrease as a function of αm
∂
∂αm��αm=0
n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) } =
�n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) } ·
�− yih(xi; θm)
��
αm=0
=
�n�
i=1
W (m−1)i
�− yih(xi; θm)
��
Tommi Jaakkola, MIT CSAIL 11
![Page 113: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/113.jpg)
Empirical exponential loss cont’d• We find h(x; θm) that minimizes
−n�
i=1
W (m−1)i yih(xi; θm)
We can also normalize the weights:
−n�
i=1
W (m−1)i�n
j=1 W (m−1)j
yih(xi; θm)
= −n�
i=1
W (m−1)i yih(xi; θm)
so that�n
i=1 W (m−1)i = 1.
Tommi Jaakkola, MIT CSAIL 13
![Page 114: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/114.jpg)
Selecting a new component: summary• We find h(x; θm) that minimizes
−n�
i=1
W (m−1)i yih(xi; θm)
where�n
i=1 W (m−1)i = 1.
• αm is subsequently chosen to minimize
n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 14
![Page 115: Trees, Bagging, Random Forests and Boostingsaravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/1… · Predictors • 48 quantitative predictors—the percentage of words](https://reader034.vdocuments.us/reader034/viewer/2022042314/5f02a0e57e708231d40536c7/html5/thumbnails/115.jpg)
The AdaBoost algorithm0) Set W (0)
i = 1/n for i = 1, . . . , n
1) At the mth iteration we find (any) classifier h(x; θm) forwhich the weighted classification error �m
�m = 0.5− 12
�n�
i=1
W (m−1)i yih(xi; θm)
�
is better than chance.
2) The new component is assigned votes based on its error:
αm = 0.5 log( (1− �m)/�m )
3) The weights are updated according to (Zm is chosen so thatthe new weights W (m)
i sum to one):
W (m)i =
1Zm
· W (m−1)i · exp{−yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 18