boosting and additive trees (2)

26
Boosting and Additive Trees (2) Yi Zhang , Kevyn Collins- Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002

Upload: marli

Post on 18-Jan-2016

76 views

Category:

Documents


1 download

DESCRIPTION

Boosting and Additive Trees (2). Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002. Recap: Boosting (1). Background: Ensemble Learning Boosting Definitions, Example AdaBoost Boosting as an Additive Model Boosting Practical Issues - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Boosting and Additive Trees (2)

Boosting and Additive Trees (2)

Yi Zhang , Kevyn Collins-ThompsonAdvanced Statistical Seminar 11-745

Oct 29, 2002

Page 2: Boosting and Additive Trees (2)

Recap: Boosting (1)

• Background: Ensemble Learning• Boosting Definitions, Example• AdaBoost• Boosting as an Additive Model• Boosting Practical Issues• Exponential Loss• Other Loss Functions• Boosting Trees• Boosting as Entropy Projection• Data Mining Methods

Page 3: Boosting and Additive Trees (2)

Outline for This Class

• Find the solution based on numerical optimization• Control the model complexity and avoid over

fitting– Right sized trees for boosting– Number of iterations – Regularization

• Understand the final model (Interpretation)• Single variable• Correlation of variables

Page 4: Boosting and Additive Trees (2)

Numerical Optimization• Goal: Find f that minimize the loss Goal: Find f that minimize the loss

function over training datafunction over training data

• Gradient Descent Search in the Gradient Descent Search in the unconstrained function space to unconstrained function space to minimize the loss on training dataminimize the loss on training data

• Loss on training data converges to zeroLoss on training data converges to zero

mmm

mmm

TNmxfxf

i

iiim

gff

gfL

gxf

xfyLg

imi

*

)*(minarg

,...,g,g])(

))(,([

1

1

2m1m)()( 1

N

iii

ffxfyLfLf

1

^))(,(minarg)(minarg

},...,,{)}(),...,(),({ 2121 NNmmmm yyyxfxfxff

Page 5: Boosting and Additive Trees (2)

Gradient Search on Constrained Function Space: Gradient Tree Boosting

• Introduce a tree at the mth iteration whose predictions tm are as close as possible to the negative gradient

•Advantage compared with unconstrained gradient search: Robust, less likely for over fitting

N

iiim xTg

1

2~

));((minarg

Page 6: Boosting and Additive Trees (2)

Algorithm 3: MART

M

J

jjmmm

Rximi

j

f

End

RxIxfxfd

xfyL

c

b

x

m

jmim

^

1jm1

1jm

lm

mjmim

im

0

f :Output

For

)()()()

))(,(minarg

Rregion different t within coefficien of valueoptimal theFind )

J 1,2,...j ,R r to treeregression aFit )

function losson based r residuals pseudo Compute a)

:M to1mFor 2.

treenode terminalsingle to)(f Initialize.1

Page 7: Boosting and Additive Trees (2)

View Boosting as Linear Model

• Basis expansion: – use basis function Tm (m=1..M, each Tm is a

weak learner) to transform inputs vector X into T space, then use linear models in this new space

• Special for Boosting: Choosing of basis function Tm depends on T1,… Tm-1

Page 8: Boosting and Additive Trees (2)

Improve Boosting as Linear Model

Recap: Linear Models in Chapter 3

• Bias Variance trade off1. Subset selection (feature

selection, discrete)2. Coefficient shrinkage

(smoothing: ridge, lasso)3. Using derived input direction

(PCA, PLA)

• Multiple outcome shrinkage and selection– Exploit correlations in

different outcomes

This Chapter: Improve Boosting 1. Size of the constituent

trees J

2. Number of boosting iterations M (subset selection)

3. Regularization (Shrinkage)

Page 9: Boosting and Additive Trees (2)

Right Size Tree for Boosting (?)• The Best for one step is not the best in long run

– Using very large tree (such as C4.5) as weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation

• Simple approach: restrict all trees to be the same size J

• J limits the input features interaction level of tree-based approximation

• In practice low-order interaction effects tend to dominate, and empirically 4J 8 works well (?)

Page 10: Boosting and Additive Trees (2)
Page 11: Boosting and Additive Trees (2)

Number of Boosting Iterations(subset selection)

• Boosting will over fit as M -> • Use validation set

• Other methods … (later)

Page 12: Boosting and Additive Trees (2)

Shrinkage

• Scale the contribution of each tree by a factor 0<<1 to control the learning rate

• Both and M control prediction risk on the training data, and operate dependently M

J

jjmjmmm RxIxff

1

^

1 )(*)(

Page 13: Boosting and Additive Trees (2)
Page 14: Boosting and Additive Trees (2)

Penalized Regression

• Ridge regression or Lasso regression

^

1

1

2

1

2

Lasso ||)(

norm L2 ,Regression Ridge )(

})(*))(({minarg)(ˆ

K

kk

K

kk

N

i kikk

J

J

JxTy

Page 15: Boosting and Additive Trees (2)

Algorithm 4: Forward stagewise linear

K

1kk

*

1 1

2

,

**

^

)(f(x) .2

)()

)))((minarg),( )

:M to1mFor 2.

large M andconstant small some to0 ,,...,1,0 Initialize .1

xTOutput

signb

xTxTyka

setkk

k

kk

N

i

K

likill

k

k

Page 16: Boosting and Additive Trees (2)

( , M ) lasso regression S/t/

If is monotone in , we have k|k| = M, and the solution for algorithm 4 is identical to result of lasso regression as described in page 64.

)(ˆ

Page 17: Boosting and Additive Trees (2)

More about algorithm 4

• Algorithm 4 Algorithm 3 + Shrinkage

• L1 norm vs. L2 norm: more details later– Chapter 12 after learning SVM

Page 18: Boosting and Additive Trees (2)

Interpretation: Understanding the final model

• Single decision trees are easy to interpret

• Linear combination of trees is difficult to understand– Which features are important?– What’s the interaction between features?

Page 19: Boosting and Additive Trees (2)

Relative Importance of Individual Variables

– For a single tree, define the importance of xl as

– For additive tree, define the importance of xl as

– For K-class classification, just treat as K 2-class classification task

partitionfor xusing node allover

2

l

region over thefit constant afor over risk error squarein improve)(Tl

M

mmll T

M 1

22 )(1

Page 20: Boosting and Additive Trees (2)
Page 21: Boosting and Additive Trees (2)

Partial Dependence Plots

• Visualize dependence of approximation f(x) on the joint values of important features

• Usually the size of the subsets is small (1-3)

• Define average or partial dependence

• Can be estimated empirically using the training data:

cX

cccX dXXPXXfXXfEXf

c

C)(),(),()(

N

iiCxXf

NXf

1

_),(

1)(

Page 22: Boosting and Additive Trees (2)

10.50 vs. 10.52

• Same if predictor variables are independent• Why use 10.50 instead of 10.52 to Measure Partial

Dependency?– Example 1: f(X)=h1(xs)+ h2(xc)

– Example 2: f(X)=h1(xs)* h2(xc)

tConsXhdXXPXhXh

dXXPXhXhXf

cX

cc

cX

cc

c

c

tan)( )()()(

)())()(()(:50.10

121

21

)|(),()|),(()( :10.52~

cX

ccc dXXXPXXfXXXfEXf

c

cX

cccX dXXPXXfXXfEXf

c

C)(),(),()( :10.50

Page 23: Boosting and Additive Trees (2)
Page 24: Boosting and Additive Trees (2)
Page 25: Boosting and Additive Trees (2)

• Find the solution based on numerical optimization

• Control the model complexity and avoid over fitting– Right sized trees for boosting– Number of iterations – Regularization

• Understand the final model (Interpretation)• Single variable• Correlation of variables

Conclusion

Page 26: Boosting and Additive Trees (2)