additive model and boosting tree
DESCRIPTION
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (TRANSCRIPT
Machine Learning Workshop [email protected]
Machine learning introduc8on Logis8c regression Feature selec8on
Addi$ve Model and Boos$ng Tree
See more machine learning post: h<p://dongguo.me
Machine learning problem
• Goal of machine learning problem – Based on observed samples, find a predic8on func8on(mapping input variables space to response value space), which has predic8on ability on unseen samples
• Minimize risk exp ( ) [ ( , ( ))] ( , ( )) ( , )PR f E L Y f X L y f x P x y dxdy= = ∫
1
1( ) ( , ( ))N
emp i ii
R f L y f xN =
= ∑
Components of machine learning ‘algorithm’
• ML = Representa$on + Strategy + Op$miza$on – Representa8on: Change func8on op8miza8on problem to parameter op8miza8on problem by choosing a family space for predic8on func8on;
– Strategy: Define a loss func8on to evaluate the error between predic8on value and response value;
– Op8miza8on: Search a op8mal predic8on func8on by minimize loss
Representa8on
• Determine hypothesis space of predic8on func8on by choosing a ‘model’ – E.g. Linear model, mul8-‐level linear model, trees, Bayesian network, addi8ve model and so on
– Need balance expressive and generaliza8on ability • Choose the model with following factors considered
– About the learning problem • Difficulty of the learning problem • What models are successfully used in other similar learning problem
– About the data • Amount of samples could be observed; amount of features; interac8ve between features; outliers in data
– Specific requirements • Interpretability, Computa8onal/storage cost
Strategy
• Dis8nguish good classifiers from bad ones in hypothesis space by define a loss func8on
• Typical loss func8on • For classifica8on
– 0-‐1 LF, Logarithmic LF, binomial deviance LF, exponen8al LF, Hinge LF
• For regression – Quadra8c LF, Absolute LF, Huber LF
1
1( ) ( , ( ))N
emp i ii
R f L y f x regularizationN =
= +∑
Logarithmic loss func8on
• Loss func8on
– Binomial logarithmic loss func8on
• Minimize logarithmic loss = Maximize likelihood es8ma8on
( , ( | )) log ( | )L Y P Y X P Y X= −
( , ( | )) log ( 1| ) (1 ) log ( 0 | )L Y P Y X y P y X y P y X= − = − − =
3 typical loss func8ons for classifica8on
• binomial deviance loss func8on
• Exponen8al loss func8on
• Hinge loss func8on
( , ( )) exp( ( ))L y f x yf x= −
( , ( )) [1 ( )]L y f x yf x += −
( , ( )) log[1 exp( ( ))]L y f x yf x= + −
loss func8ons for classifica8on
From “Elements of sta/s/cal learning”
Loss func8ons for regression
From “Elements of sta/s/cal learning”
Op8miza8on
• Nothing to share this 8me
Components of typical algorithms
“model” Representa$on Strategy Op$miza$on
Polynomial regression
Polynomial func8on Squared loss usually Has closed solu8on
Linear regression Linear model of variable
Squared loss usually has closed solu8on
LR Linear func8on+ Logit link
logarithmic loss Gradient descent, Newton method
ANN Mul8 level linear func8on + Logit link
Squared loss usually Gradient descent
SVM Linear func8on Hinge loss quadra8c programming (SMO)
HMM Bayes network Logarithmic loss EM
Adaboost Addi8ve model Exponen8al loss Stagewise + op8mize base learner
Boos8ng Tree
• Addi8ve model and forward stagewise algorithm • Boos8ng tree • Adaboost • Gradient boos8ng tree
Addi8ve model
• Linear combina8on of base predictor
• Determine f(x)
– Which is difficult to inference for general loss func8on and base learner
1( ) ( ; )
M
m mm
f x b x rβ=
=∑
, 1 1min ( , ( ; ))m m
N M
i m i mr i mL y b x r
ββ
= =∑ ∑
Forward Stagewise Addi8ve Modeling
• Idea: Approximately inference by learning base func8on one by one
0
1, 1
1
1
(1). ( ) 0(2). 1,2,...,
( ). ( , ) argmin ( , ( ) ( ; ));
( ). ( ) ( ) ( ; )
(3). ( ) ( ) ( ; )
N
m m i m i ir i
m m m mM
M m mm
f xform M
a r L y f x b x r
b f x f x b x r
f x f x b x r
ββ β
β
β
−=
−
=
==
= +
= +
= =
∑
∑
Boos8ng tree
• Boos8ng tree = forward stagewise addi8ve modeling with decision tree as base learner
• Different implementa8ons of boos8ng tree with different loss func8on
• Could be used for regression and classifica8on both
1( ) ( ) ( ; )m m mf x f x T x−= + Θ
11
argmin ( , ( ) ( ; ))m
N
m i m i i miL y f x T x
∧
−Θ =
Θ = + Θ∑
Boos8ng tree for regression
• When quadra8c loss func8on is chosen 1 1 2 2
0
1
: {( , ), ( , ),..., ( , )}, ,: ( )
1. ( ) 02. 1 :( ). ( ), 1, 2,...,( ). ( ; )
nN N i i
M
mi i m i
m
Input training setT x y x y x y x R y ROutput boosting tree for regression f xInit f xForm toMa residual r y f x i Nb learna regressiontreeT x by fitting
−
= ∈ ∈
==
= − =Θ
1
1
( ). ( ) ( ) ( ; )3.
( ) ( ; )
mi
m m m
M
M mm
rc update f x f x T xget final regressionboosting tree
f x T x
−
=
= + Θ
= Θ∑
Boos8ng tree for classifica8on
• When exponen8al loss func8on is chosen – Adaboost + classifica8on tree
• When binomial deviance loss func8on is chosen – LogitBoost + classifica8on tree
( , ( )) log[1 exp( ( ))]L y f x yf x= + −
( , ( )) exp( ( ))L y f x yf x= −
Adaboost review
1 11 1 1 1
: , { 1, 1};
1( ,..., ,..., ), , 1, 2,...,
: ( ) : { 1,1
ni i i=1 i
i N i
m m
Input training set{(x , y )} y interationsnumberM1.Init weight of training samples
W w w w w i NN
2.Form=1toM :1). fit a baselearnerusing dataset withweightW G x χ
= − +
= = =
→ −
1
1 1,1 1,
}
: ( ( ) )
113). ( ) : log2
4).( ,..., ,.
N
m mi m i ii
mm m
m
m m m i
2).calculateclassificaiton error on training dataset e w I G x y
ecalculatecoeffient of G x using classificationerror ae
updateweight of each training sampleW w w
=
+ + +
= ≠
−=
=
∑
1, 1,
1
.., ), exp( ( ))3.
( ) ( ( )) ( ( ))
m N m i mi m i m i
M
m mm
w w w a y G xget final classifier
G x sign f x sign a G x
+ +
=
← −
= = ∑
Adaboost : forward stagewise addi8ve modeling with exponen8al loss
• Exponen8al loss func8on
• Forward stagewise addi8ve modeling
( , ( )) exp[ ( )]L y f x yf x= −
1( ) ( ) ( )m m m mf x f x a G x−= +
1, 1
1, 1
( , ( )) argmin exp[ ( ( ) ( ))]
argmin exp[ ( ))], exp[ ( )]
N
m m i m i ia G iN
mi mii i i m ia G i
a G x y f x aG x
w y aG x w y f x
−=
−=
= − +
= − = −
∑
∑
m minferencea andG (x)
Adaboost : forward stagewise addi8ve modeling with exponen8al loss (2)
• Con8nue..
1
( ) argmin ( ( ))N
mim i iG iG x w I y G x∗
=
= ≠∑( ): 0,mInferenceG x for anya wehave>
mInferencea
1 ( ) ( )
( ) ( ) ( )
1 1
exp[ ( ))]
( )
( ) ( ( ))
i m i i m i
i m i i m i i m i
Na a
mi mi mii ii y G x y G x
a a a ami mi mi
y G x y G x y G x
N Na a a
mi mii ii i
w y aG x w e w e
w e e w e w e
e e w I y G x e w
−
= = ≠
− − −
≠ ≠ =
− −
= =
− = +
= − + +
= − ≠ +
∑ ∑ ∑
∑ ∑ ∑
∑ ∑
1
1
1
( ( ))11 log , ( ( ))
2
N
mi i m i Nm i
m m mi i m iNim mi
i
w I y G xea e w I y G xe w
∗ =
=
=
≠−⇒ = = = ≠
∑∑
∑
Adaboost : forward stagewise addi8ve modeling with exponen8al loss (3)
• Weight update for each sample
1( ) ( ) ( )m m m mf x f x a G x−= +
1, exp[ ( )]m i i m iw y f x+ = −
1, , exp( ( ))m i m i i m mw w y a G x+⇒ = −
CART review
• Select variable according to gini
• Could be used for regression and classifica8on • Generate the tree as large as possible firstly, and prune via valida8on
• Parameters – Height; Stop split condi8on
1 21 2 1 2 1
| | | |( , ) ( ) ( ), {( , ) | ( ) },| | | |D DGini D A Gini D Gini D D x y D A x a D D DD D
= + = ∈ = = −
2
1 1( ) (1 ) 1
K K
k k kk k
Gini p p p p= =
= − = −∑ ∑
Experiment
• Goal: evaluate performance of boos8ng tree • Algorithms – Logis8c regression – CART – Boos8ng tree (adaboost + CART)
• Hulu inside datasets – Ad intelligence
Experiment (2)
• Task: predict whether the recall is high or low (binary classifica8on)
• Dataset: Ad intelligence – 718 samples; 93 features – 5-‐fold cross valida8on
• AUC with Logis8c regression: 0.89 • Parameters for boos8ng tree – Tree height, base learner number, and stop split condi8ons
Experiment (3)
• Test results with boos8ng tree: 0.96 – 0.79 for single CART (height 6)
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
AUC
base leaner number
AUC on test dataset (5-‐fold cross valida$on)
H=2
H=3
H=4
H=5
H=6
Gradient boos8ng
• Allow op8miza8on of an arbitrary differen8able loss func8on
• Use gradient descent idea to approximate the residual
– When choose quadra8c loss func8on, it’s common residual
1( ) ( )
( , ( )):( )
mf x f x
L y f xpseduo residualf x
−=
⎡ ⎤∂− ⎢ ⎥∂⎣ ⎦
21( ( )) ( ( ))2
L y f x y f x− = −
Gradient boos8ng: Pseudo code
ni i i=1
n
0 ir i=1
im
Input : training set{(x , y )} ; a differentiable loss function L(y,F(x));interations number M1.Initializemodel withaconstant value :
F (x)= argmin L(y ,r)
2.For m = 1to M :1).Compute pseudo - residuals :
L(y,F(x)r = - ∂
∑
)
m-1F(x)=F (x)
nm i im i=1
m
m i m-1 i
) for i = 1,...,n.F(x)
2).Fit abaselearner h (x)to pseudo - residuals(trainusing dataset{(x ,r )} )3 .Computemultipiler r by solving the following optimization problem
= argmin L(y ,F (x )+γ
γ
⎡ ⎤⎢ ⎥∂⎣ ⎦
) ( )3. ( )
n
m ii=1
m m-1 m m
M
h (x ))
4.Updatethemodel :F (x)= F (x h x
Output F x
γ
γ+
∑
Gradient tree boos8ng
• Use decision tree as base learner • Stagewise learning and choose r with line search • Friedman proposes to choose a separate op8mize value r for each of the tree’s regions
1( ) ( )
J
m jm jmj
h x b I x R=
= ∈∑
1 11
( ) ( ) ( ), argmin ( , ( ) ( ))n
m m m m m i m i m ii
F x F x h x L y F x h xγ
γ γ γ− −=
= + = +∑
1 11
( ) ( ) ( ), argmin ( , ( ) ( ))jm
i jm
J
m m jm jm i m i m ij x R
F x F x I x R L y F x h xγ
γ γ γ− −= ∈
= + ∈ = +∑ ∑
Parameters choice and tricks
• Parameters choice – Terminal nodes J: [4, 8] is recommended – Itera8ons M: selected by evalua8on on test/valida8on data
• Tricks for improvement – Shrinkage: – Stochas8c gradient boos8ng
1( ) ( ) ( ), 0 1m m m mF x F x h x vν γ−= + ⋅ < ≤
Boos8ng Tree Summary
• Forward stagewise addi8ve model with tree • Pros – Performance is good usually – Adapt to regression and classifica8on both – No need to transform/normalized the data – Few parameters and is easy to tune
• Tips – Try more loss func8ons besides exponen8al loss, especially when noise exists in data
– Bump is usually good
Resource
• Implementa8on/Tools – MART(Mul8ply Addi8ve regression tree) – Will share my implementa8on later
• More for boos8ng tree – “Elements of sta/s/cal learning” – 《统计学习方法》 – Paralleliza8on: “Scaling up machine learning”