decision trees - boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfdecision trees...

Decision TreesBoosting and bagging

CS780/880 Introduction to Machine Learning

04/06/2017

SVM: Classification with Maximum Margin Hyperplane

−1 0 1 2 3

−1

01

23

X1

X2

Kernel SVM: Polynomial and Radial Kernels

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4−

4−

20

24

X1X1

X2

X2

Regression Methods

I Covered 6+ classification methods

I Regression methods (4+)?I Which ones are generative/discriminative?

Regression Methods

I Covered 6+ classification methodsI Regression methods (4+)?

I Which ones are generative/discriminative?

Regression Methods

I Covered 6+ classification methodsI Regression methods (4+)?I Which ones are generative/discriminative?

Regression Trees

I Predict Baseball Salary based on Years played and Hits

I Example:

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Tree Partition Space

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

Advantages/Disadvantages of Decision Trees

I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features

I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization

Decision Tree Terminology

I Internal nodesI BranchesI Leaves

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Types of Decision Trees

I Regression trees

I Classification tree

Learning a Decision Tree

I NP Hard problem

I Approximate algorithms (heuristics):I ID3, C4.5, C5.0 (classification)I CART (classification and regression trees)I MARS (regression trees)I . . .

CART: Learning Regression Trees

Two basic steps:

1. Divide predictor space into regions R1, . . . , RJ

2. Make the same prediction for all data points that fall in Rj

CART: Recursive Binary Spli�ing

I Greedy top-to-bo�om approach

I Recursively divide regions to minimize RSS∑xi∈R1

(yi − y1)2 +∑

xi∈R2

(yi − y2)2

CART: Spli�ing Example

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

Tree Pruning

I Bias-variance trade-o� with regression trees?

I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

I Why is it be�er to prune than to stop early?

Tree Pruning

I Bias-variance trade-o� with regression trees?I May overfit with many leaves.

I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |


Tree Pruning

I Bias-variance trade-o� with regression trees?I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |


Pruning Example

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5

Runs < 47.5

Walks < 52.5

RBI < 80.5

Years < 6.5

5.487

4.622 5.183

5.394 6.189

6.015 5.5716.407 6.549

6.459 7.0077.289

Impact of Pruning

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Me

an

Sq

ua

red

Err

or

Training

Cross−Validation

Test

Classification TreesI Similar to regression treesI Except RSS does not make senseI Use other measures of quality:

1. Classification error rate

1−maxk

pmk

O�en too pessimistic in practice2. Gini (impurity) index (CART):

K∑k=1

pmk(1− pmk)

3. Cross-entropy (information gain) (ID3, C4.5):

−K∑

k=1

pmk log pmk

I ID3, C4.5 do not prune

Why Not Use Classification Error?

Decision tree with classification error

Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Why Not Use Classification Error?

Decision tree with information gain



Why Not Use Classification Error?Entropy is more optimistic



Pruning in Classification Trees

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

NoYes

No

No No No Yes

Yes No No

No Yes

Yes Yes

Yes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

Trees vs. Linear Models

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

Trees vs. KNN

I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are be�er in low-dimensional problems with complex

decision boundaries

Bagging and Boosting

I Methods for reducing variance of decision treesI Make predictions using a weighted vote of multiple treesI Boosted trees are some of the most successful general machine

learning methods (on Kaggle)

I Disadvantage of using votes of multiple trees?

Bagging

I Stands for “Bootstrap Aggregating”I Construct multiple bootstrapped training sets:

T1, T2, . . . , TB

I Fit a tree to each one:

f1, f2, . . . , fB

I Make predictions by averaging individual tree predictions

f(x) =1

B

B∑b=1

fb(x)

I Large values of B are not likely to overfit, B ≈ 100 is a goodchoice

Random Forests

I Many trees in bagging will be similarI Algorithms choose the same features to split onI Random forests help to address similarity:

I At each split, choose only fromm randomly sampled features

I Good empirical choice ism =√p

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test C

lassific

ation E

rror

m=p

m=p/2

m= p

Cross-validation and Bagging

I No need for cross-validation when baggingI Evaluating trees on out-of bag samples is su�icient

0 50 100 150 200 250 300

0.1

00

.15

0.2

00

.25

0.3

0

Number of Trees

Err

or

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

Boosting (Gradient Boosting, AdaBoost)

What Kaggle has to say:

source:

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)

I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?


I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .

I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)



I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .

I Combined prediction:

f(x) =∑i

λifi(x)



I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)


Gradient Boosting (Regression)I Just use gradient descent

I Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2


yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Gradient Boosting


yi −m∑j=1

fj(xi) ≈ fm+1(xi)

I Fit new tree to the following target (instead of yi)

yi −m∑j=1

fj(xi)

I Compute the weight λ by line searchI And many other bells and whistles

Gradient Boosting


yi −m∑j=1

fj(xi) ≈ fm+1(xi)


yi −m∑j=1

fj(xi)

I Compute the weight λ by line search

I And many other bells and whistles

Gradient Boosting


yi −m∑j=1

fj(xi) ≈ fm+1(xi)


yi −m∑j=1

fj(xi)

I Compute the weight λ by line searchI And many other bells and whistles

XGBoost

I Scalable and flexible gradient boostingI Interfaces for many languages and environments

decision trees - boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfdecision trees...

Documents