decision trees - boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfdecision trees...
TRANSCRIPT
Decision TreesBoosting and bagging
CS780/880 Introduction to Machine Learning
04/06/2017
SVM: Classification with Maximum Margin Hyperplane
−1 0 1 2 3
−1
01
23
X1
X2
Kernel SVM: Polynomial and Radial Kernels
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4−
4−
20
24
X1X1
X2
X2
Regression Methods
I Covered 6+ classification methods
I Regression methods (4+)?I Which ones are generative/discriminative?
Regression Methods
I Covered 6+ classification methodsI Regression methods (4+)?
I Which ones are generative/discriminative?
Regression Methods
I Covered 6+ classification methodsI Regression methods (4+)?I Which ones are generative/discriminative?
Regression Trees
I Predict Baseball Salary based on Years played and Hits
I Example:
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
Tree Partition Space
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74Years
Hits
1
117.5
238
1 4.5 24
R1
R3
R2
Advantages/Disadvantages of Decision Trees
I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features
I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization
Advantages/Disadvantages of Decision Trees
I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features
I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization
Advantages/Disadvantages of Decision Trees
I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features
I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization
Decision Tree Terminology
I Internal nodesI BranchesI Leaves
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
Types of Decision Trees
I Regression trees
I Classification tree
Learning a Decision Tree
I NP Hard problem
I Approximate algorithms (heuristics):I ID3, C4.5, C5.0 (classification)I CART (classification and regression trees)I MARS (regression trees)I . . .
Learning a Decision Tree
I NP Hard problem
I Approximate algorithms (heuristics):I ID3, C4.5, C5.0 (classification)I CART (classification and regression trees)I MARS (regression trees)I . . .
CART: Learning Regression Trees
Two basic steps:
1. Divide predictor space into regions R1, . . . , RJ
2. Make the same prediction for all data points that fall in Rj
CART: Recursive Binary Spli�ing
I Greedy top-to-bo�om approach
I Recursively divide regions to minimize RSS∑xi∈R1
(yi − y1)2 +∑
xi∈R2
(yi − y2)2
CART: Spli�ing Example
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
Tree Pruning
I Bias-variance trade-o� with regression trees?
I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |
I Why is it be�er to prune than to stop early?
Tree Pruning
I Bias-variance trade-o� with regression trees?I May overfit with many leaves.
I Be�er to build a large tree and then prune it to minimize:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |
I Why is it be�er to prune than to stop early?
Tree Pruning
I Bias-variance trade-o� with regression trees?I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |
I Why is it be�er to prune than to stop early?
Tree Pruning
I Bias-variance trade-o� with regression trees?I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:
|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T |
I Why is it be�er to prune than to stop early?
Pruning Example
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
|Years < 4.5
RBI < 60.5
Putouts < 82
Years < 3.5
Years < 3.5
Hits < 117.5
Walks < 43.5
Runs < 47.5
Walks < 52.5
RBI < 80.5
Years < 6.5
5.487
4.622 5.183
5.394 6.189
6.015 5.5716.407 6.549
6.459 7.0077.289
Impact of Pruning
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Tree Size
Me
an
Sq
ua
red
Err
or
Training
Cross−Validation
Test
Classification TreesI Similar to regression treesI Except RSS does not make senseI Use other measures of quality:
1. Classification error rate
1−maxk
pmk
O�en too pessimistic in practice2. Gini (impurity) index (CART):
K∑k=1
pmk(1− pmk)
3. Cross-entropy (information gain) (ID3, C4.5):
−K∑
k=1
pmk log pmk
I ID3, C4.5 do not prune
Why Not Use Classification Error?
Decision tree with classification error
Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html
Why Not Use Classification Error?
Decision tree with information gain
Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html
Why Not Use Classification Error?Entropy is more optimistic
Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html
Pruning in Classification Trees
|Thal:a
Ca < 0.5
MaxHR < 161.5
RestBP < 157
Chol < 244MaxHR < 156
MaxHR < 145.5
ChestPain:bc
Chol < 244 Sex < 0.5
Ca < 0.5
Slope < 1.5
Age < 52 Thal:b
ChestPain:a
Oldpeak < 1.1
RestECG < 1
No YesNo
NoYes
No
No No No Yes
Yes No No
No Yes
Yes Yes
Yes
5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Tree Size
Err
or
TrainingCross−ValidationTest
|Thal:a
Ca < 0.5
MaxHR < 161.5 ChestPain:bc
Ca < 0.5
No No
No Yes
Yes Yes
Trees vs. Linear Models
−2 −1 0 1 2
−2
−1
01
2
X1
X2
−2 −1 0 1 2
−2
−1
01
2
X1
X2
−2 −1 0 1 2
−2
−1
01
2
X1
X2
−2 −1 0 1 2
−2
−1
01
2
X1
X2
Trees vs. KNN
I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are be�er in low-dimensional problems with complex
decision boundaries
Trees vs. KNN
I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are be�er in low-dimensional problems with complex
decision boundaries
Bagging and Boosting
I Methods for reducing variance of decision treesI Make predictions using a weighted vote of multiple treesI Boosted trees are some of the most successful general machine
learning methods (on Kaggle)
I Disadvantage of using votes of multiple trees?
Bagging and Boosting
I Methods for reducing variance of decision treesI Make predictions using a weighted vote of multiple treesI Boosted trees are some of the most successful general machine
learning methods (on Kaggle)
I Disadvantage of using votes of multiple trees?
Bagging
I Stands for “Bootstrap Aggregating”I Construct multiple bootstrapped training sets:
T1, T2, . . . , TB
I Fit a tree to each one:
f1, f2, . . . , fB
I Make predictions by averaging individual tree predictions
f(x) =1
B
B∑b=1
fb(x)
I Large values of B are not likely to overfit, B ≈ 100 is a goodchoice
Random Forests
I Many trees in bagging will be similarI Algorithms choose the same features to split onI Random forests help to address similarity:
I At each split, choose only fromm randomly sampled features
I Good empirical choice ism =√p
0 100 200 300 400 500
0.2
0.3
0.4
0.5
Number of Trees
Test C
lassific
ation E
rror
m=p
m=p/2
m= p
Cross-validation and Bagging
I No need for cross-validation when baggingI Evaluating trees on out-of bag samples is su�icient
0 50 100 150 200 250 300
0.1
00
.15
0.2
00
.25
0.3
0
Number of Trees
Err
or
Test: Bagging
Test: RandomForest
OOB: Bagging
OOB: RandomForest
Boosting (Gradient Boosting, AdaBoost)
What Kaggle has to say:
source:
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Gradient Boosting (Regression)
I Boosting uses all of data, not a random subset (usually)
I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:
f(x) =∑i
λifi(x)
I Assume we have 1 . . .m trees and weights, next best tree?
Gradient Boosting (Regression)
I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .
I and weights λ1, λ2, . . .I Combined prediction:
f(x) =∑i
λifi(x)
I Assume we have 1 . . .m trees and weights, next best tree?
Gradient Boosting (Regression)
I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .
I Combined prediction:
f(x) =∑i
λifi(x)
I Assume we have 1 . . .m trees and weights, next best tree?
Gradient Boosting (Regression)
I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:
f(x) =∑i
λifi(x)
I Assume we have 1 . . .m trees and weights, next best tree?
Gradient Boosting (Regression)
I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:
f(x) =∑i
λifi(x)
I Assume we have 1 . . .m trees and weights, next best tree?
Gradient Boosting (Regression)I Just use gradient descent
I Objective is to minimize RSS (1/2):
1
2
n∑i=1
(yi − f(xi))2
I Objective with the new treem+ 1:
1
2
n∑i=1
yi − m∑j=1
fj(xi)− fm+1(xi)
2
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):
1
2
n∑i=1
(yi − f(xi))2
I Objective with the new treem+ 1:
1
2
n∑i=1
yi − m∑j=1
fj(xi)− fm+1(xi)
2
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):
1
2
n∑i=1
(yi − f(xi))2
I Objective with the new treem+ 1:
1
2
n∑i=1
yi − m∑j=1
fj(xi)− fm+1(xi)
2
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):
1
2
n∑i=1
(yi − f(xi))2
I Objective with the new treem+ 1:
1
2
n∑i=1
yi − m∑j=1
fj(xi)− fm+1(xi)
2
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
Gradient Boosting
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
I Fit new tree to the following target (instead of yi)
yi −m∑j=1
fj(xi)
I Compute the weight λ by line searchI And many other bells and whistles
Gradient Boosting
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
I Fit new tree to the following target (instead of yi)
yi −m∑j=1
fj(xi)
I Compute the weight λ by line searchI And many other bells and whistles
Gradient Boosting
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
I Fit new tree to the following target (instead of yi)
yi −m∑j=1
fj(xi)
I Compute the weight λ by line search
I And many other bells and whistles
Gradient Boosting
I Greatest reduction in RSS: gradient
yi −m∑j=1
fj(xi) ≈ fm+1(xi)
I Fit new tree to the following target (instead of yi)
yi −m∑j=1
fj(xi)
I Compute the weight λ by line searchI And many other bells and whistles
XGBoost
I Scalable and flexible gradient boostingI Interfaces for many languages and environments