boosting and additive trees (2)
DESCRIPTION
Boosting and Additive Trees (2). Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002. Recap: Boosting (1). Background: Ensemble Learning Boosting Definitions, Example AdaBoost Boosting as an Additive Model Boosting Practical Issues - PowerPoint PPT PresentationTRANSCRIPT
Boosting and Additive Trees (2)
Yi Zhang , Kevyn Collins-ThompsonAdvanced Statistical Seminar 11-745
Oct 29, 2002
Recap: Boosting (1)
• Background: Ensemble Learning• Boosting Definitions, Example• AdaBoost• Boosting as an Additive Model• Boosting Practical Issues• Exponential Loss• Other Loss Functions• Boosting Trees• Boosting as Entropy Projection• Data Mining Methods
Outline for This Class
• Find the solution based on numerical optimization• Control the model complexity and avoid over
fitting– Right sized trees for boosting– Number of iterations – Regularization
• Understand the final model (Interpretation)• Single variable• Correlation of variables
Numerical Optimization• Goal: Find f that minimize the loss Goal: Find f that minimize the loss
function over training datafunction over training data
• Gradient Descent Search in the Gradient Descent Search in the unconstrained function space to unconstrained function space to minimize the loss on training dataminimize the loss on training data
• Loss on training data converges to zeroLoss on training data converges to zero
mmm
mmm
TNmxfxf
i
iiim
gff
gfL
gxf
xfyLg
imi
*
)*(minarg
,...,g,g])(
))(,([
1
1
2m1m)()( 1
N
iii
ffxfyLfLf
1
^))(,(minarg)(minarg
},...,,{)}(),...,(),({ 2121 NNmmmm yyyxfxfxff
Gradient Search on Constrained Function Space: Gradient Tree Boosting
• Introduce a tree at the mth iteration whose predictions tm are as close as possible to the negative gradient
•Advantage compared with unconstrained gradient search: Robust, less likely for over fitting
N
iiim xTg
1
2~
));((minarg
Algorithm 3: MART
M
J
jjmmm
Rximi
j
f
End
RxIxfxfd
xfyL
c
b
x
m
jmim
^
1jm1
1jm
lm
mjmim
im
0
f :Output
For
)()()()
))(,(minarg
Rregion different t within coefficien of valueoptimal theFind )
J 1,2,...j ,R r to treeregression aFit )
function losson based r residuals pseudo Compute a)
:M to1mFor 2.
treenode terminalsingle to)(f Initialize.1
View Boosting as Linear Model
• Basis expansion: – use basis function Tm (m=1..M, each Tm is a
weak learner) to transform inputs vector X into T space, then use linear models in this new space
• Special for Boosting: Choosing of basis function Tm depends on T1,… Tm-1
Improve Boosting as Linear Model
Recap: Linear Models in Chapter 3
• Bias Variance trade off1. Subset selection (feature
selection, discrete)2. Coefficient shrinkage
(smoothing: ridge, lasso)3. Using derived input direction
(PCA, PLA)
• Multiple outcome shrinkage and selection– Exploit correlations in
different outcomes
This Chapter: Improve Boosting 1. Size of the constituent
trees J
2. Number of boosting iterations M (subset selection)
3. Regularization (Shrinkage)
Right Size Tree for Boosting (?)• The Best for one step is not the best in long run
– Using very large tree (such as C4.5) as weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation
• Simple approach: restrict all trees to be the same size J
• J limits the input features interaction level of tree-based approximation
• In practice low-order interaction effects tend to dominate, and empirically 4J 8 works well (?)
Number of Boosting Iterations(subset selection)
• Boosting will over fit as M -> • Use validation set
• Other methods … (later)
Shrinkage
• Scale the contribution of each tree by a factor 0<<1 to control the learning rate
• Both and M control prediction risk on the training data, and operate dependently M
J
jjmjmmm RxIxff
1
^
1 )(*)(
Penalized Regression
• Ridge regression or Lasso regression
^
1
1
2
1
2
Lasso ||)(
norm L2 ,Regression Ridge )(
})(*))(({minarg)(ˆ
K
kk
K
kk
N
i kikk
J
J
JxTy
Algorithm 4: Forward stagewise linear
K
1kk
*
1 1
2
,
**
^
)(f(x) .2
)()
)))((minarg),( )
:M to1mFor 2.
large M andconstant small some to0 ,,...,1,0 Initialize .1
xTOutput
signb
xTxTyka
setkk
k
kk
N
i
K
likill
k
k
( , M ) lasso regression S/t/
If is monotone in , we have k|k| = M, and the solution for algorithm 4 is identical to result of lasso regression as described in page 64.
)(ˆ
More about algorithm 4
• Algorithm 4 Algorithm 3 + Shrinkage
• L1 norm vs. L2 norm: more details later– Chapter 12 after learning SVM
Interpretation: Understanding the final model
• Single decision trees are easy to interpret
• Linear combination of trees is difficult to understand– Which features are important?– What’s the interaction between features?
Relative Importance of Individual Variables
– For a single tree, define the importance of xl as
– For additive tree, define the importance of xl as
– For K-class classification, just treat as K 2-class classification task
partitionfor xusing node allover
2
l
region over thefit constant afor over risk error squarein improve)(Tl
M
mmll T
M 1
22 )(1
Partial Dependence Plots
• Visualize dependence of approximation f(x) on the joint values of important features
• Usually the size of the subsets is small (1-3)
• Define average or partial dependence
• Can be estimated empirically using the training data:
cX
cccX dXXPXXfXXfEXf
c
C)(),(),()(
N
iiCxXf
NXf
1
_),(
1)(
10.50 vs. 10.52
• Same if predictor variables are independent• Why use 10.50 instead of 10.52 to Measure Partial
Dependency?– Example 1: f(X)=h1(xs)+ h2(xc)
– Example 2: f(X)=h1(xs)* h2(xc)
tConsXhdXXPXhXh
dXXPXhXhXf
cX
cc
cX
cc
c
c
tan)( )()()(
)())()(()(:50.10
121
21
)|(),()|),(()( :10.52~
cX
ccc dXXXPXXfXXXfEXf
c
cX
cccX dXXPXXfXXfEXf
c
C)(),(),()( :10.50
• Find the solution based on numerical optimization
• Control the model complexity and avoid over fitting– Right sized trees for boosting– Number of iterations – Regularization
• Understand the final model (Interpretation)• Single variable• Correlation of variables
Conclusion