decision trees - boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfdecision trees...

52
Decision Trees Boosting and bagging CS780/880 Introduction to Machine Learning 04/06/2017

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Decision TreesBoosting and bagging

CS780/880 Introduction to Machine Learning

04/06/2017

Page 2: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

SVM: Classification with Maximum Margin Hyperplane

−1 0 1 2 3

−1

01

23

X1

X2

Page 3: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Kernel SVM: Polynomial and Radial Kernels

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4−

4−

20

24

X1X1

X2

X2

Page 4: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Regression Methods

I Covered 6+ classification methods

I Regression methods (4+)?I Which ones are generative/discriminative?

Page 5: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Regression Methods

I Covered 6+ classification methodsI Regression methods (4+)?

I Which ones are generative/discriminative?

Page 6: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Regression Methods

I Covered 6+ classification methodsI Regression methods (4+)?I Which ones are generative/discriminative?

Page 7: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Regression Trees

I Predict Baseball Salary based on Years played and Hits

I Example:

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Page 8: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Tree Partition Space

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

Page 9: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Advantages/Disadvantages of Decision Trees

I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features

I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization

Page 10: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Advantages/Disadvantages of Decision Trees

I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features

I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization

Page 11: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Advantages/Disadvantages of Decision Trees

I Advantages:I InterpretabilityI Non-linearityI Li�le data preparation, scale invarianceI Works with qualitative and quantitative features

I Disadvantages:I Hard to encode prior knowledgeI Di�icult to fitI Limited generalization

Page 12: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Decision Tree Terminology

I Internal nodesI BranchesI Leaves

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Page 13: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Types of Decision Trees

I Regression trees

I Classification tree

Page 14: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Learning a Decision Tree

I NP Hard problem

I Approximate algorithms (heuristics):I ID3, C4.5, C5.0 (classification)I CART (classification and regression trees)I MARS (regression trees)I . . .

Page 15: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Learning a Decision Tree

I NP Hard problem

I Approximate algorithms (heuristics):I ID3, C4.5, C5.0 (classification)I CART (classification and regression trees)I MARS (regression trees)I . . .

Page 16: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

CART: Learning Regression Trees

Two basic steps:

1. Divide predictor space into regions R1, . . . , RJ

2. Make the same prediction for all data points that fall in Rj

Page 17: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

CART: Recursive Binary Spli�ing

I Greedy top-to-bo�om approach

I Recursively divide regions to minimize RSS∑xi∈R1

(yi − y1)2 +∑

xi∈R2

(yi − y2)2

Page 18: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

CART: Spli�ing Example

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

Page 19: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Tree Pruning

I Bias-variance trade-o� with regression trees?

I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

I Why is it be�er to prune than to stop early?

Page 20: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Tree Pruning

I Bias-variance trade-o� with regression trees?I May overfit with many leaves.

I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

I Why is it be�er to prune than to stop early?

Page 21: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Tree Pruning

I Bias-variance trade-o� with regression trees?I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

I Why is it be�er to prune than to stop early?

Page 22: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Tree Pruning

I Bias-variance trade-o� with regression trees?I May overfit with many leaves.I Be�er to build a large tree and then prune it to minimize:

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

I Why is it be�er to prune than to stop early?

Page 23: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Pruning Example

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5

Runs < 47.5

Walks < 52.5

RBI < 80.5

Years < 6.5

5.487

4.622 5.183

5.394 6.189

6.015 5.5716.407 6.549

6.459 7.0077.289

Page 24: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Impact of Pruning

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Me

an

Sq

ua

red

Err

or

Training

Cross−Validation

Test

Page 25: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Classification TreesI Similar to regression treesI Except RSS does not make senseI Use other measures of quality:

1. Classification error rate

1−maxk

pmk

O�en too pessimistic in practice2. Gini (impurity) index (CART):

K∑k=1

pmk(1− pmk)

3. Cross-entropy (information gain) (ID3, C4.5):

−K∑

k=1

pmk log pmk

I ID3, C4.5 do not prune

Page 26: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Why Not Use Classification Error?

Decision tree with classification error

Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Page 27: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Why Not Use Classification Error?

Decision tree with information gain

Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Page 28: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Why Not Use Classification Error?Entropy is more optimistic

Source: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Page 29: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Pruning in Classification Trees

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

NoYes

No

No No No Yes

Yes No No

No Yes

Yes Yes

Yes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

Page 30: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Trees vs. Linear Models

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

Page 31: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Trees vs. KNN

I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are be�er in low-dimensional problems with complex

decision boundaries

Page 32: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Trees vs. KNN

I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are be�er in low-dimensional problems with complex

decision boundaries

Page 33: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Bagging and Boosting

I Methods for reducing variance of decision treesI Make predictions using a weighted vote of multiple treesI Boosted trees are some of the most successful general machine

learning methods (on Kaggle)

I Disadvantage of using votes of multiple trees?

Page 34: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Bagging and Boosting

I Methods for reducing variance of decision treesI Make predictions using a weighted vote of multiple treesI Boosted trees are some of the most successful general machine

learning methods (on Kaggle)

I Disadvantage of using votes of multiple trees?

Page 35: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Bagging

I Stands for “Bootstrap Aggregating”I Construct multiple bootstrapped training sets:

T1, T2, . . . , TB

I Fit a tree to each one:

f1, f2, . . . , fB

I Make predictions by averaging individual tree predictions

f(x) =1

B

B∑b=1

fb(x)

I Large values of B are not likely to overfit, B ≈ 100 is a goodchoice

Page 36: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Random Forests

I Many trees in bagging will be similarI Algorithms choose the same features to split onI Random forests help to address similarity:

I At each split, choose only fromm randomly sampled features

I Good empirical choice ism =√p

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test C

lassific

ation E

rror

m=p

m=p/2

m= p

Page 37: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Cross-validation and Bagging

I No need for cross-validation when baggingI Evaluating trees on out-of bag samples is su�icient

0 50 100 150 200 250 300

0.1

00

.15

0.2

00

.25

0.3

0

Number of Trees

Err

or

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

Page 38: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Boosting (Gradient Boosting, AdaBoost)

What Kaggle has to say:

source:

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

Page 39: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)

I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 40: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .

I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 41: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .

I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 42: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 43: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .I and weights λ1, λ2, . . .I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 44: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)I Just use gradient descent

I Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Page 45: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Page 46: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Page 47: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting (Regression)I Just use gradient descentI Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Page 48: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

I Fit new tree to the following target (instead of yi)

yi −m∑j=1

fj(xi)

I Compute the weight λ by line searchI And many other bells and whistles

Page 49: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

I Fit new tree to the following target (instead of yi)

yi −m∑j=1

fj(xi)

I Compute the weight λ by line searchI And many other bells and whistles

Page 50: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

I Fit new tree to the following target (instead of yi)

yi −m∑j=1

fj(xi)

I Compute the weight λ by line search

I And many other bells and whistles

Page 51: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

Gradient Boosting

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

I Fit new tree to the following target (instead of yi)

yi −m∑j=1

fj(xi)

I Compute the weight λ by line searchI And many other bells and whistles

Page 52: Decision Trees - Boosting and baggingmpetrik/teaching/intro_ml_17_files/class14.pdfDecision Trees Boosting and bagging CS780/880 Introduction to Machine Learning ... I CART (classification

XGBoost

I Scalable and flexible gradient boostingI Interfaces for many languages and environments