additive model and boosting tree

Machine Learning Workshop [email protected]

Machine learning introduc8on Logis8c regression Feature selec8on

Addi$ve Model and Boos$ng Tree

See more machine learning post: h<p://dongguo.me

Machine learning problem

•  Goal of machine learning problem –  Based on observed samples, find a predic8on func8on(mapping input variables space to response value space), which has predic8on ability on unseen samples

•  Minimize risk exp ( ) [ ( , ( ))] ( , ( )) ( , )PR f E L Y f X L y f x P x y dxdy= = ∫

1

1( ) ( , ( ))N

emp i ii

R f L y f xN =

= ∑

Components of machine learning ‘algorithm’

•  ML = Representa$on + Strategy + Op$miza$on –  Representa8on: Change func8on op8miza8on problem to parameter op8miza8on problem by choosing a family space for predic8on func8on;

–  Strategy: Define a loss func8on to evaluate the error between predic8on value and response value;

–  Op8miza8on: Search a op8mal predic8on func8on by minimize loss

Representa8on

•  Determine hypothesis space of predic8on func8on by choosing a ‘model’ –  E.g. Linear model, mul8-‐level linear model, trees, Bayesian network, addi8ve model and so on

–  Need balance expressive and generaliza8on ability •  Choose the model with following factors considered

–  About the learning problem •  Difficulty of the learning problem •  What models are successfully used in other similar learning problem

–  About the data •  Amount of samples could be observed; amount of features; interac8ve between features; outliers in data

–  Specific requirements •  Interpretability, Computa8onal/storage cost

Strategy

•  Dis8nguish good classifiers from bad ones in hypothesis space by define a loss func8on

•  Typical loss func8on •  For classifica8on

–  0-‐1 LF, Logarithmic LF, binomial deviance LF, exponen8al LF, Hinge LF

•  For regression –  Quadra8c LF, Absolute LF, Huber LF

1

1( ) ( , ( ))N

emp i ii

R f L y f x regularizationN =

= +∑

Logarithmic loss func8on

•  Loss func8on

–  Binomial logarithmic loss func8on

•  Minimize logarithmic loss = Maximize likelihood es8ma8on

( , ( | )) log ( | )L Y P Y X P Y X= −

( , ( | )) log ( 1| ) (1 ) log ( 0 | )L Y P Y X y P y X y P y X= − = − − =

3 typical loss func8ons for classifica8on

•  binomial deviance loss func8on

•  Exponen8al loss func8on

•  Hinge loss func8on

( , ( )) exp( ( ))L y f x yf x= −

( , ( )) [1 ( )]L y f x yf x += −

( , ( )) log[1 exp( ( ))]L y f x yf x= + −

loss func8ons for classifica8on

From “Elements of sta/s/cal learning”

Loss func8ons for regression

From “Elements of sta/s/cal learning”

Op8miza8on

•  Nothing to share this 8me

Components of typical algorithms

“model” Representa$on Strategy Op$miza$on

Polynomial regression

Polynomial func8on Squared loss usually Has closed solu8on

Linear regression Linear model of variable

Squared loss usually has closed solu8on

LR Linear func8on+ Logit link

logarithmic loss Gradient descent, Newton method

ANN Mul8 level linear func8on + Logit link

Squared loss usually Gradient descent

SVM Linear func8on Hinge loss quadra8c programming (SMO)

HMM Bayes network Logarithmic loss EM

Adaboost Addi8ve model Exponen8al loss Stagewise + op8mize base learner

Boos8ng Tree

•  Addi8ve model and forward stagewise algorithm •  Boos8ng tree •  Adaboost •  Gradient boos8ng tree

Addi8ve model

•  Linear combina8on of base predictor

•  Determine f(x)

– Which is difficult to inference for general loss func8on and base learner

1( ) ( ; )

M

m mm

f x b x rβ=

=∑

, 1 1min ( , ( ; ))m m

N M

i m i mr i mL y b x r

ββ

= =∑ ∑

Forward Stagewise Addi8ve Modeling

•  Idea: Approximately inference by learning base func8on one by one

0

1, 1

1

1

(1). ( ) 0(2). 1,2,...,

( ). ( , ) argmin ( , ( ) ( ; ));

( ). ( ) ( ) ( ; )

(3). ( ) ( ) ( ; )

N

m m i m i ir i

m m m mM

M m mm

f xform M

a r L y f x b x r

b f x f x b x r

f x f x b x r

ββ β

β

β

−=

−

=

==

= +

= +

= =

∑

∑

Boos8ng tree

•  Boos8ng tree = forward stagewise addi8ve modeling with decision tree as base learner

•  Different implementa8ons of boos8ng tree with different loss func8on

•  Could be used for regression and classifica8on both

1( ) ( ) ( ; )m m mf x f x T x−= + Θ

11

argmin ( , ( ) ( ; ))m

N

m i m i i miL y f x T x

∧

−Θ =

Θ = + Θ∑

Boos8ng tree for regression

•  When quadra8c loss func8on is chosen 1 1 2 2

0

1

: {( , ), ( , ),..., ( , )}, ,: ( )

1. ( ) 02. 1 :( ). ( ), 1, 2,...,( ). ( ; )

nN N i i

M

mi i m i

m

Input training setT x y x y x y x R y ROutput boosting tree for regression f xInit f xForm toMa residual r y f x i Nb learna regressiontreeT x by fitting

−

= ∈ ∈

==

= − =Θ

1

1

( ). ( ) ( ) ( ; )3.

( ) ( ; )

mi

m m m

M

M mm

rc update f x f x T xget final regressionboosting tree

f x T x

−

=

= + Θ

= Θ∑

Boos8ng tree for classifica8on

•  When exponen8al loss func8on is chosen –  Adaboost + classifica8on tree

•  When binomial deviance loss func8on is chosen –  LogitBoost + classifica8on tree

( , ( )) log[1 exp( ( ))]L y f x yf x= + −

( , ( )) exp( ( ))L y f x yf x= −

Adaboost review

1 11 1 1 1

: , { 1, 1};

1( ,..., ,..., ), , 1, 2,...,

: ( ) : { 1,1

ni i i=1 i

i N i

m m

Input training set{(x , y )} y interationsnumberM1.Init weight of training samples

W w w w w i NN

2.Form=1toM :1). fit a baselearnerusing dataset withweightW G x χ

= − +

= = =

→ −

1

1 1,1 1,

}

: ( ( ) )

113). ( ) : log2

4).( ,..., ,.

N

m mi m i ii

mm m

m

m m m i

2).calculateclassificaiton error on training dataset e w I G x y

ecalculatecoeffient of G x using classificationerror ae

updateweight of each training sampleW w w

=

+ + +

= ≠

−=

=

∑

1, 1,

1

.., ), exp( ( ))3.

( ) ( ( )) ( ( ))

m N m i mi m i m i

M

m mm

w w w a y G xget final classifier

G x sign f x sign a G x

+ +

=

← −

= = ∑

Adaboost : forward stagewise addi8ve modeling with exponen8al loss

•  Exponen8al loss func8on

•  Forward stagewise addi8ve modeling

( , ( )) exp[ ( )]L y f x yf x= −

1( ) ( ) ( )m m m mf x f x a G x−= +

1, 1

1, 1

( , ( )) argmin exp[ ( ( ) ( ))]

argmin exp[ ( ))], exp[ ( )]

N

m m i m i ia G iN

mi mii i i m ia G i

a G x y f x aG x

w y aG x w y f x

−=

−=

= − +

= − = −

∑

∑

m minferencea andG (x)

Adaboost : forward stagewise addi8ve modeling with exponen8al loss (2)

•  Con8nue..

1

( ) argmin ( ( ))N

mim i iG iG x w I y G x∗

=

= ≠∑( ): 0,mInferenceG x for anya wehave>

mInferencea

1 ( ) ( )

( ) ( ) ( )

1 1

exp[ ( ))]

( )

( ) ( ( ))

i m i i m i

i m i i m i i m i

Na a

mi mi mii ii y G x y G x

a a a ami mi mi

y G x y G x y G x

N Na a a

mi mii ii i

w y aG x w e w e

w e e w e w e

e e w I y G x e w

−

= = ≠

− − −

≠ ≠ =

− −

= =

− = +

= − + +

= − ≠ +

∑ ∑ ∑

∑ ∑ ∑

∑ ∑

1

1

1

( ( ))11 log , ( ( ))

2

N

mi i m i Nm i

m m mi i m iNim mi

i

w I y G xea e w I y G xe w

∗ =

=

=

≠−⇒ = = = ≠

∑∑

∑

Adaboost : forward stagewise addi8ve modeling with exponen8al loss (3)

•  Weight update for each sample

1( ) ( ) ( )m m m mf x f x a G x−= +

1, exp[ ( )]m i i m iw y f x+ = −

1, , exp( ( ))m i m i i m mw w y a G x+⇒ = −

CART review

•  Select variable according to gini

•  Could be used for regression and classifica8on •  Generate the tree as large as possible firstly, and prune via valida8on

•  Parameters –  Height; Stop split condi8on

1 21 2 1 2 1

| | | |( , ) ( ) ( ), {( , ) | ( ) },| | | |D DGini D A Gini D Gini D D x y D A x a D D DD D

= + = ∈ = = −

2

1 1( ) (1 ) 1

K K

k k kk k

Gini p p p p= =

= − = −∑ ∑

Experiment

•  Goal: evaluate performance of boos8ng tree •  Algorithms –  Logis8c regression –  CART –  Boos8ng tree (adaboost + CART)

•  Hulu inside datasets –  Ad intelligence

Experiment (2)

•  Task: predict whether the recall is high or low (binary classifica8on)

•  Dataset: Ad intelligence –  718 samples; 93 features –  5-‐fold cross valida8on

•  AUC with Logis8c regression: 0.89 •  Parameters for boos8ng tree –  Tree height, base learner number, and stop split condi8ons

Experiment (3)

•  Test results with boos8ng tree: 0.96 –  0.79 for single CART (height 6)

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

AUC

base leaner number

AUC on test dataset (5-‐fold cross valida$on)

H=2

H=3

H=4

H=5

H=6

Gradient boos8ng

•  Allow op8miza8on of an arbitrary differen8able loss func8on

•  Use gradient descent idea to approximate the residual

– When choose quadra8c loss func8on, it’s common residual

1( ) ( )

( , ( )):( )

mf x f x

L y f xpseduo residualf x

−=

⎡ ⎤∂− ⎢ ⎥∂⎣ ⎦

21( ( )) ( ( ))2

L y f x y f x− = −

Gradient boos8ng: Pseudo code

ni i i=1

n

0 ir i=1

im

Input : training set{(x , y )} ; a differentiable loss function L(y,F(x));interations number M1.Initializemodel withaconstant value :

F (x)= argmin L(y ,r)

2.For m = 1to M :1).Compute pseudo - residuals :

L(y,F(x)r = - ∂

∑

)

m-1F(x)=F (x)

nm i im i=1

m

m i m-1 i

) for i = 1,...,n.F(x)

2).Fit abaselearner h (x)to pseudo - residuals(trainusing dataset{(x ,r )} )3 .Computemultipiler r by solving the following optimization problem

= argmin L(y ,F (x )+γ

γ

⎡ ⎤⎢ ⎥∂⎣ ⎦

) ( )3. ( )

n

m ii=1

m m-1 m m

M

h (x ))

4.Updatethemodel :F (x)= F (x h x

Output F x

γ

γ+

∑

Gradient tree boos8ng

•  Use decision tree as base learner •  Stagewise learning and choose r with line search •  Friedman proposes to choose a separate op8mize value r for each of the tree’s regions

1( ) ( )

J

m jm jmj

h x b I x R=

= ∈∑

1 11

( ) ( ) ( ), argmin ( , ( ) ( ))n

m m m m m i m i m ii

F x F x h x L y F x h xγ

γ γ γ− −=

= + = +∑

1 11

( ) ( ) ( ), argmin ( , ( ) ( ))jm

i jm

J

m m jm jm i m i m ij x R

F x F x I x R L y F x h xγ

γ γ γ− −= ∈

= + ∈ = +∑ ∑

Parameters choice and tricks

•  Parameters choice –  Terminal nodes J: [4, 8] is recommended –  Itera8ons M: selected by evalua8on on test/valida8on data

•  Tricks for improvement –  Shrinkage: –  Stochas8c gradient boos8ng

1( ) ( ) ( ), 0 1m m m mF x F x h x vν γ−= + ⋅ < ≤

Boos8ng Tree Summary

•  Forward stagewise addi8ve model with tree •  Pros –  Performance is good usually –  Adapt to regression and classifica8on both –  No need to transform/normalized the data –  Few parameters and is easy to tune

•  Tips –  Try more loss func8ons besides exponen8al loss, especially when noise exists in data

–  Bump is usually good

Resource

•  Implementa8on/Tools – MART(Mul8ply Addi8ve regression tree) – Will share my implementa8on later

•  More for boos8ng tree –  “Elements of sta/s/cal learning” – 《统计学习方法》 –  Paralleliza8on: “Scaling up machine learning”

additive model and boosting tree

Technology