additive model and boosting tree

31
Machine Learning Workshop [email protected] Machine learning introduc8on Logis8c regression Feature selec8on Addi$ve Model and Boos$ng Tree See more machine learning post: h<p://dongguo.me

Upload: -

Post on 28-Nov-2014

1.392 views

Category:

Technology


0 download

DESCRIPTION

this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (

TRANSCRIPT

Page 1: Additive model and boosting tree

Machine  Learning  Workshop  [email protected]  

Machine  learning  introduc8on  Logis8c  regression  Feature  selec8on  

Addi$ve  Model  and  Boos$ng  Tree    

See  more  machine  learning  post:  h<p://dongguo.me      

Page 2: Additive model and boosting tree

Machine  learning  problem  

•  Goal  of  machine  learning  problem  –  Based  on  observed  samples,  find  a  predic8on  func8on(mapping  input  variables  space  to  response  value  space),  which  has  predic8on  ability  on  unseen  samples  

•  Minimize  risk  exp ( ) [ ( , ( ))] ( , ( )) ( , )PR f E L Y f X L y f x P x y dxdy= = ∫

1

1( ) ( , ( ))N

emp i ii

R f L y f xN =

= ∑

Page 3: Additive model and boosting tree

Components  of  machine  learning  ‘algorithm’  

•  ML  =  Representa$on  +  Strategy  +  Op$miza$on  –  Representa8on:  Change  func8on  op8miza8on  problem  to  parameter  op8miza8on  problem  by  choosing  a  family  space  for  predic8on  func8on;    

–  Strategy:  Define  a  loss  func8on  to  evaluate  the  error  between  predic8on  value  and  response  value;  

–  Op8miza8on:  Search  a  op8mal  predic8on  func8on  by  minimize  loss  

Page 4: Additive model and boosting tree

Representa8on  

•  Determine  hypothesis  space  of  predic8on  func8on  by  choosing  a  ‘model’  –  E.g.  Linear  model,  mul8-­‐level  linear  model,  trees,  Bayesian  network,  addi8ve  model  and  so  on  

–  Need  balance  expressive  and  generaliza8on  ability  •  Choose  the  model  with  following  factors  considered  

–  About  the  learning  problem  •  Difficulty  of  the  learning  problem  •  What  models  are  successfully  used  in  other  similar  learning  problem  

–  About  the  data  •  Amount  of  samples  could  be  observed;  amount  of  features;  interac8ve  between  features;  outliers  in  data  

–  Specific  requirements  •  Interpretability,  Computa8onal/storage  cost  

Page 5: Additive model and boosting tree

Strategy  

•  Dis8nguish  good  classifiers  from  bad  ones  in  hypothesis  space  by  define  a  loss  func8on  

 •  Typical  loss  func8on  •  For  classifica8on  

–  0-­‐1  LF,  Logarithmic  LF,  binomial  deviance  LF,  exponen8al  LF,  Hinge  LF  

•  For  regression  –  Quadra8c  LF,  Absolute  LF,  Huber  LF  

1

1( ) ( , ( ))N

emp i ii

R f L y f x regularizationN =

= +∑

Page 6: Additive model and boosting tree

Logarithmic  loss  func8on  

•  Loss  func8on    

–  Binomial  logarithmic  loss  func8on  

•  Minimize  logarithmic  loss  =  Maximize  likelihood  es8ma8on  

( , ( | )) log ( | )L Y P Y X P Y X= −

( , ( | )) log ( 1| ) (1 ) log ( 0 | )L Y P Y X y P y X y P y X= − = − − =

Page 7: Additive model and boosting tree

3  typical  loss  func8ons  for  classifica8on  

•  binomial  deviance  loss  func8on  

•  Exponen8al  loss  func8on    

•  Hinge  loss  func8on  

( , ( )) exp( ( ))L y f x yf x= −

( , ( )) [1 ( )]L y f x yf x += −

( , ( )) log[1 exp( ( ))]L y f x yf x= + −

Page 8: Additive model and boosting tree

loss  func8ons  for  classifica8on  

From  “Elements  of  sta/s/cal  learning”  

Page 9: Additive model and boosting tree

Loss  func8ons  for  regression  

From  “Elements  of  sta/s/cal  learning”  

Page 10: Additive model and boosting tree

Op8miza8on  

•  Nothing  to  share  this  8me  

Page 11: Additive model and boosting tree

Components  of  typical  algorithms  

“model”   Representa$on   Strategy   Op$miza$on  

Polynomial  regression  

Polynomial  func8on   Squared  loss  usually   Has  closed  solu8on  

Linear  regression   Linear  model  of  variable  

Squared  loss  usually   has  closed  solu8on  

LR   Linear  func8on+  Logit  link  

logarithmic  loss   Gradient  descent,  Newton  method  

ANN   Mul8  level  linear  func8on  +  Logit  link  

Squared  loss  usually   Gradient  descent  

SVM   Linear  func8on   Hinge  loss   quadra8c  programming  (SMO)  

HMM   Bayes  network   Logarithmic  loss   EM  

Adaboost   Addi8ve  model   Exponen8al  loss   Stagewise  +  op8mize  base  learner    

Page 12: Additive model and boosting tree

Boos8ng  Tree  

•  Addi8ve  model  and  forward  stagewise  algorithm  •  Boos8ng  tree  •  Adaboost  •  Gradient  boos8ng  tree  

Page 13: Additive model and boosting tree

Addi8ve  model  

•  Linear  combina8on  of  base  predictor  

•  Determine  f(x)  

– Which  is  difficult  to  inference  for  general  loss  func8on  and  base  learner  

1( ) ( ; )

M

m mm

f x b x rβ=

=∑

, 1 1min ( , ( ; ))m m

N M

i m i mr i mL y b x r

ββ

= =∑ ∑

Page 14: Additive model and boosting tree

Forward  Stagewise  Addi8ve  Modeling  

•  Idea:  Approximately  inference  by  learning  base  func8on  one  by  one  

0

1, 1

1

1

(1). ( ) 0(2). 1,2,...,

( ). ( , ) argmin ( , ( ) ( ; ));

( ). ( ) ( ) ( ; )

(3). ( ) ( ) ( ; )

N

m m i m i ir i

m m m mM

M m mm

f xform M

a r L y f x b x r

b f x f x b x r

f x f x b x r

ββ β

β

β

−=

=

==

= +

= +

= =

Page 15: Additive model and boosting tree

Boos8ng  tree  

•  Boos8ng  tree  =  forward  stagewise  addi8ve  modeling  with  decision  tree  as  base  learner  

•  Different  implementa8ons  of  boos8ng  tree  with  different  loss  func8on  

•  Could  be  used  for  regression  and  classifica8on  both  

1( ) ( ) ( ; )m m mf x f x T x−= + Θ

11

argmin ( , ( ) ( ; ))m

N

m i m i i miL y f x T x

−Θ =

Θ = + Θ∑

Page 16: Additive model and boosting tree

Boos8ng  tree  for  regression  

•  When  quadra8c  loss  func8on  is  chosen  1 1 2 2

0

1

: {( , ), ( , ),..., ( , )}, ,: ( )

1. ( ) 02. 1 :( ). ( ), 1, 2,...,( ). ( ; )

nN N i i

M

mi i m i

m

Input training setT x y x y x y x R y ROutput boosting tree for regression f xInit f xForm toMa residual r y f x i Nb learna regressiontreeT x by fitting

= ∈ ∈

==

= − =Θ

1

1

( ). ( ) ( ) ( ; )3.

( ) ( ; )

mi

m m m

M

M mm

rc update f x f x T xget final regressionboosting tree

f x T x

=

= + Θ

= Θ∑

Page 17: Additive model and boosting tree

Boos8ng  tree  for  classifica8on  

•  When  exponen8al  loss  func8on  is  chosen  –  Adaboost  +  classifica8on  tree  

•  When  binomial  deviance  loss  func8on  is  chosen  –  LogitBoost  +  classifica8on  tree  

( , ( )) log[1 exp( ( ))]L y f x yf x= + −

( , ( )) exp( ( ))L y f x yf x= −

Page 18: Additive model and boosting tree

Adaboost  review  

   1 11 1 1 1

: , { 1, 1};

1( ,..., ,..., ), , 1, 2,...,

: ( ) : { 1,1

ni i i=1 i

i N i

m m

Input training set{(x , y )} y interationsnumberM1.Init weight of training samples

W w w w w i NN

2.Form=1toM :1). fit a baselearnerusing dataset withweightW G x χ

= − +

= = =

→ −

1

1 1,1 1,

}

: ( ( ) )

113). ( ) : log2

4).( ,..., ,.

N

m mi m i ii

mm m

m

m m m i

2).calculateclassificaiton error on training dataset e w I G x y

ecalculatecoeffient of G x using classificationerror ae

updateweight of each training sampleW w w

=

+ + +

= ≠

−=

=

1, 1,

1

.., ), exp( ( ))3.

( ) ( ( )) ( ( ))

m N m i mi m i m i

M

m mm

w w w a y G xget final classifier

G x sign f x sign a G x

+ +

=

← −

= = ∑

Page 19: Additive model and boosting tree

Adaboost  :  forward  stagewise  addi8ve  modeling  with  exponen8al  loss  

•  Exponen8al  loss  func8on  

•  Forward  stagewise  addi8ve  modeling      

( , ( )) exp[ ( )]L y f x yf x= −

1( ) ( ) ( )m m m mf x f x a G x−= +

1, 1

1, 1

( , ( )) argmin exp[ ( ( ) ( ))]

argmin exp[ ( ))], exp[ ( )]

N

m m i m i ia G iN

mi mii i i m ia G i

a G x y f x aG x

w y aG x w y f x

−=

−=

= − +

= − = −

m minferencea andG (x)

Page 20: Additive model and boosting tree

Adaboost  :  forward  stagewise  addi8ve  modeling  with  exponen8al  loss  (2)  

•  Con8nue..  

 

             1

( ) argmin ( ( ))N

mim i iG iG x w I y G x∗

=

= ≠∑( ): 0,mInferenceG x for anya wehave>

mInferencea

1 ( ) ( )

( ) ( ) ( )

1 1

exp[ ( ))]

( )

( ) ( ( ))

i m i i m i

i m i i m i i m i

Na a

mi mi mii ii y G x y G x

a a a ami mi mi

y G x y G x y G x

N Na a a

mi mii ii i

w y aG x w e w e

w e e w e w e

e e w I y G x e w

= = ≠

− − −

≠ ≠ =

− −

= =

− = +

= − + +

= − ≠ +

∑ ∑ ∑

∑ ∑ ∑

∑ ∑

1

1

1

( ( ))11 log , ( ( ))

2

N

mi i m i Nm i

m m mi i m iNim mi

i

w I y G xea e w I y G xe w

∗ =

=

=

≠−⇒ = = = ≠

∑∑

Page 21: Additive model and boosting tree

Adaboost  :  forward  stagewise  addi8ve  modeling  with  exponen8al  loss  (3)  

•  Weight  update  for  each  sample  

1( ) ( ) ( )m m m mf x f x a G x−= +

1, exp[ ( )]m i i m iw y f x+ = −

1, , exp( ( ))m i m i i m mw w y a G x+⇒ = −

Page 22: Additive model and boosting tree

CART  review  

•  Select  variable  according  to  gini  

•  Could  be  used  for  regression  and  classifica8on  •  Generate  the  tree  as  large  as  possible  firstly,  and  prune  via  valida8on    

•  Parameters  –  Height;    Stop  split  condi8on  

1 21 2 1 2 1

| | | |( , ) ( ) ( ), {( , ) | ( ) },| | | |D DGini D A Gini D Gini D D x y D A x a D D DD D

= + = ∈ = = −

2

1 1( ) (1 ) 1

K K

k k kk k

Gini p p p p= =

= − = −∑ ∑

Page 23: Additive model and boosting tree

Experiment  

•  Goal:  evaluate  performance  of  boos8ng  tree  •  Algorithms  –  Logis8c  regression  –  CART  –  Boos8ng  tree  (adaboost  +  CART)  

•  Hulu  inside  datasets  –  Ad  intelligence  

Page 24: Additive model and boosting tree

Experiment  (2)  

•  Task:  predict  whether  the  recall  is  high  or  low  (binary  classifica8on)  

•  Dataset:  Ad  intelligence  –  718  samples;  93  features  –  5-­‐fold  cross  valida8on  

•  AUC  with  Logis8c  regression:  0.89  •  Parameters  for  boos8ng  tree  –  Tree  height,  base  learner  number,  and  stop  split  condi8ons  

Page 25: Additive model and boosting tree

Experiment  (3)  

•  Test  results  with  boos8ng  tree:  0.96  –  0.79  for  single  CART  (height  6)  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  

AUC  

base  leaner  number  

AUC  on  test  dataset  (5-­‐fold  cross  valida$on)  

H=2  

H=3  

H=4  

H=5  

H=6  

Page 26: Additive model and boosting tree

Gradient  boos8ng    

•  Allow  op8miza8on  of  an  arbitrary  differen8able  loss  func8on  

•  Use  gradient  descent  idea  to  approximate  the  residual  

– When  choose  quadra8c  loss  func8on,  it’s  common  residual  

1( ) ( )

( , ( )):( )

mf x f x

L y f xpseduo residualf x

−=

⎡ ⎤∂− ⎢ ⎥∂⎣ ⎦

21( ( )) ( ( ))2

L y f x y f x− = −

Page 27: Additive model and boosting tree

Gradient  boos8ng:  Pseudo  code  

   

ni i i=1

n

0 ir i=1

im

Input : training set{(x , y )} ; a differentiable loss function L(y,F(x));interations number M1.Initializemodel withaconstant value :

F (x)= argmin L(y ,r)

2.For m = 1to M :1).Compute pseudo - residuals :

L(y,F(x)r = - ∂

)

m-1F(x)=F (x)

nm i im i=1

m

m i m-1 i

) for i = 1,...,n.F(x)

2).Fit abaselearner h (x)to pseudo - residuals(trainusing dataset{(x ,r )} )3 .Computemultipiler r by solving the following optimization problem

= argmin L(y ,F (x )+γ

γ

⎡ ⎤⎢ ⎥∂⎣ ⎦

) ( )3. ( )

n

m ii=1

m m-1 m m

M

h (x ))

4.Updatethemodel :F (x)= F (x h x

Output F x

γ

γ+

Page 28: Additive model and boosting tree

Gradient  tree  boos8ng  

•  Use  decision  tree  as  base  learner    •  Stagewise  learning  and  choose  r  with  line  search    •  Friedman  proposes  to  choose  a  separate  op8mize  value  r  for  each  of  the  tree’s  regions  

1( ) ( )

J

m jm jmj

h x b I x R=

= ∈∑

1 11

( ) ( ) ( ), argmin ( , ( ) ( ))n

m m m m m i m i m ii

F x F x h x L y F x h xγ

γ γ γ− −=

= + = +∑

1 11

( ) ( ) ( ), argmin ( , ( ) ( ))jm

i jm

J

m m jm jm i m i m ij x R

F x F x I x R L y F x h xγ

γ γ γ− −= ∈

= + ∈ = +∑ ∑

Page 29: Additive model and boosting tree

Parameters  choice  and  tricks  

•  Parameters  choice  –  Terminal  nodes  J:  [4,  8]  is  recommended  –  Itera8ons  M:  selected  by  evalua8on  on  test/valida8on  data  

 •  Tricks  for  improvement  –  Shrinkage:    –  Stochas8c  gradient  boos8ng  

1( ) ( ) ( ), 0 1m m m mF x F x h x vν γ−= + ⋅ < ≤

Page 30: Additive model and boosting tree

Boos8ng  Tree  Summary  

•  Forward  stagewise  addi8ve  model  with  tree  •  Pros  –  Performance  is  good  usually  –  Adapt  to  regression  and  classifica8on  both  –  No  need  to  transform/normalized  the  data  –  Few  parameters  and  is  easy  to  tune  

•  Tips  –  Try  more  loss  func8ons  besides  exponen8al  loss,  especially  when  noise  exists  in  data  

–  Bump  is  usually  good  

Page 31: Additive model and boosting tree

Resource  

•  Implementa8on/Tools  – MART(Mul8ply  Addi8ve  regression  tree)  – Will  share  my  implementa8on  later    

•  More  for  boos8ng  tree  –  “Elements  of  sta/s/cal  learning”    – 《统计学习方法》  –  Paralleliza8on:  “Scaling  up  machine  learning”