boosting and additive tree (2)

8/6/2019 Boosting and Additive Tree (2)

1/26

Boosting and Additive Trees (2)

Yi Zhang , Kevyn Collins-Thompson

Advanced Statistical Seminar 11-745Oct 29, 2002


2/26

Recap: Boosting (1)

Background: Ensemble Learning

Boosting Definitions, Example

AdaBoost

Boosting as an Additive Model

Boosting Practical Issues

Exponential Loss

Other Loss Functions

Boosting Trees Boosting as Entropy Projection

Data Mining Methods


3/26

Outline for This Class Find the solution based on numerical optimization

Control the model complexity and avoid over

fittingRight sized trees for boosting

Number of iterations

Regularization

Understand the final model (Interpretation) Single variable

Correlation of variables


4/26

Numerical Optimization

Goal: Find f that minimize the loss function overGoal: Find f that minimize the loss function overtraining datatraining data

Gradient Descent Search in the unconstrainedGradient Descent Search in the unconstrainedfunction space to minimize the loss on training datafunction space to minimize the loss on training data

Loss on training data converges to zeroLoss on training data converges to zero

mmm

mmm

TNmxfxf

i

iiim

gff

gfL

gxf

xfyLg

imi

*

)*

(minarg

,...,g,g])(

))(,([

1

1

2m1m)()( 1

V

VV V

!

!

!x

x!

!

!

!!N

i

iiff

xfyLfLf1

^

))(,(minarg)(minarg

},...,,{)}(),...,(),({ 2121 NN yyyxfxfxff p!


5/26

Gradient Search on Constrained Function

Space: Gradient Tree Boosting

Introduce a tree at the mth iteration whose predictions

tm are as close as possible to the negative gradient

Advantage compared with unconstrained gradient

search: Robust, less likely for over fitting

!5

5!5N

i

ii xg1

2~

));((minarg


6/26

Algorithm 3: MAR

T

M

J

jjmmm

Rx

imij

f

End

RxIxfxfd

xfyL

c

b

x

m

jmim

!

!

!

!!"

!

!

^

1m1

1m

lm

mmim

im

0

:utput

or

)()()()

))(,(minarg

Rregiondi erentt ithincoe icienovalueoptimaltheind)

J1,2,...,Rrtotreeregressionait)

unctionlossonbasedrresidualspseudoomputea)

:to1mor2.

treenodeterminalsingleto)(Initialize.1

K

KKK


7/26

View Boosting as Linear Model

Basis expansion:

use basis function Tm (m=1..M, each Tm is a

weak learner) to transform inputs vector X into

T space, then use linear models in this new

space

Special for Boosting: Choosing of basisfunction Tm depends on T1, Tm-1


8/26

Improve Boosting as Linear Model

Recap: Linear Models inChapter 3

Bias Variance trade off1. Subset selection (feature

selection, discrete)

2. Coefficient shrinkage(smoothing: ridge, lasso)

3. Using derived input direction(PCA, PLA)

Multiple outcome shrinkage

and selection Exploit correlations in

different outcomes

This Chapter: Improve

Boosting

1. Size of the constituent

trees J

2. Number of boosting

iterations M (subset

selection)

3.Regularization (Shrinkage)


9/26

Right Size Tree for Boosting (?)

The Best for one step is not the best in long run

Using very large tree (such as C4.5) as weak learner tofit the residue assumes each tree is the last one in the

expansion. Usually degrade performance and increasecomputation

Simple approach: restrict all trees to be the samesize J

J limits the input features interaction level of tree-based approximation

In practice low-order interaction effects tend todominate, and empirically 4J 8 works well (?)


10/26


11/26

Number of Boosting Iterations

(subset selection)

Boosting will over fit as M ->

Use validation set

Other methods (later)


12/26

Shrinkage Scale the contribution of each tree by a

factor 0


13/26


14/26

PenalizedR

egression Ridge regression or Lasso regression

^

1

1

2

1

2

Lasso||)(

normL2,RegressionRidge)(

})(*))(({minarg)(

!

!

!

!

!

!

K

k

k

K

k

k

N

i k

ikk

J

J

Jxy

EE

EE

EPEPEE


15/26

Algorithm 4: Forward stagewise

linear

!

! !

!

n

!

!

"!!

K

1k k

*

1 1

2

,

**

^

)(f(x).2

)()

)))((minarg),()

:Mto1mFor2.

largeMandconstantsmallsometo0,,...,1,0Initialize.1

xOutput

signb

xxyka

setkk

k

kk

N

i

K

l

ikillk

k

E

FIEE

FEF

IE

F


16/26

(, M ) lasso regression S/t/

If is

monotone in, we have

k|k| = M,

and the

solution for

algorithm 4 is

identical to

result of lasso

regression as

described in

page 64.

)( PE


17/26

More about algorithm 4 Algorithm 4 Algorithm 3 + Shrinkage

L1 norm vs. L2 norm: more details later

Chapter 12 after learning SVM


18/26

Interpretation: Understanding the

final model Single decision trees are easy to interpret

Linear combination of trees is difficult to

understand

Which features are important?

Whats the interaction between features?


19/26

Relative Importance of Individual

Variables For a single tree, define the importance of xl as

For additive tree, define the importance of x l as

For K-class classification, just treat as K 2-class

classification task

!-partitionforxusingnodeallover

2

l

regionover thefitconstantaforoverriskerrorsquareinimprove)(Tl

!

-!-M

m

mll TM

1

22)(

1


20/26


21/26

Partial Dependence Plots Visualize dependence of approximation f(x)

on the joint values of important features

Usually the size of the subsets is small (1-3)

Define average or partial dependence

Can be estimated empirically using the

training data:

c

X

cccX dXXPXXfXXfXf

c

C)(),(),()( !! ::::

!

!N

i

ixfN

f1

_

),(1

)( :::


22/26

10.50 vs. 10.52

Same if predictor variables are independent

Why use 10.50 instead of 10.52 to Measure PartialDependency?

Example 1: f(X)=h1(xs)+ h2(xc)

Example 2: f(X)=h1(xs)* h2(xc)

tonshdhh

dhhf

ccc

ccc

c

c

tan)()()()(

)())()(()(:50.10

121

21

!!

!

::

:::

)|(),()|),(()(:10.52~

c

X

ccc dXXXPXXfXXXfXf

c

!! ::::::

c

X

cccX dXXPXXfXXfXf

c

)(),(),()(:10.50 !! ::::


23/26


24/26


25/26

Find the solution based on numericaloptimization

Control the model complexity and avoid

over fittingRight sized trees for boosting

Number of iterations

R

egularization Understand the final model (Interpretation)

Single variable

Correlation of variables

Conclusion


26/26

boosting and additive tree (2)

Documents