boosting and additive tree (2)

Upload: jigar-patel

Post on 08-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Boosting and Additive Tree (2)

    1/26

    Boosting and Additive Trees (2)

    Yi Zhang , Kevyn Collins-Thompson

    Advanced Statistical Seminar 11-745Oct 29, 2002

  • 8/6/2019 Boosting and Additive Tree (2)

    2/26

    Recap: Boosting (1)

    Background: Ensemble Learning

    Boosting Definitions, Example

    AdaBoost

    Boosting as an Additive Model

    Boosting Practical Issues

    Exponential Loss

    Other Loss Functions

    Boosting Trees Boosting as Entropy Projection

    Data Mining Methods

  • 8/6/2019 Boosting and Additive Tree (2)

    3/26

    Outline for This Class Find the solution based on numerical optimization

    Control the model complexity and avoid over

    fittingRight sized trees for boosting

    Number of iterations

    Regularization

    Understand the final model (Interpretation) Single variable

    Correlation of variables

  • 8/6/2019 Boosting and Additive Tree (2)

    4/26

    Numerical Optimization

    Goal: Find f that minimize the loss function overGoal: Find f that minimize the loss function overtraining datatraining data

    Gradient Descent Search in the unconstrainedGradient Descent Search in the unconstrainedfunction space to minimize the loss on training datafunction space to minimize the loss on training data

    Loss on training data converges to zeroLoss on training data converges to zero

    mmm

    mmm

    TNmxfxf

    i

    iiim

    gff

    gfL

    gxf

    xfyLg

    imi

    *

    )*

    (minarg

    ,...,g,g])(

    ))(,([

    1

    1

    2m1m)()( 1

    V

    VV V

    !

    !

    !x

    x!

    !

    !

    !!N

    i

    iiff

    xfyLfLf1

    ^

    ))(,(minarg)(minarg

    },...,,{)}(),...,(),({ 2121 NN yyyxfxfxff p!

  • 8/6/2019 Boosting and Additive Tree (2)

    5/26

    Gradient Search on Constrained Function

    Space: Gradient Tree Boosting

    Introduce a tree at the mth iteration whose predictions

    tm are as close as possible to the negative gradient

    Advantage compared with unconstrained gradient

    search: Robust, less likely for over fitting

    !5

    5!5N

    i

    ii xg1

    2~

    ));((minarg

  • 8/6/2019 Boosting and Additive Tree (2)

    6/26

    Algorithm 3: MAR

    T

    M

    J

    jjmmm

    Rx

    imij

    f

    End

    RxIxfxfd

    xfyL

    c

    b

    x

    m

    jmim

    !

    !

    !

    !!"

    !

    !

    ^

    1m1

    1m

    lm

    mmim

    im

    0

    :utput

    or

    )()()()

    ))(,(minarg

    Rregiondi erentt ithincoe icienovalueoptimaltheind)

    J1,2,...,Rrtotreeregressionait)

    unctionlossonbasedrresidualspseudoomputea)

    :to1mor2.

    treenodeterminalsingleto)(Initialize.1

    K

    KKK

  • 8/6/2019 Boosting and Additive Tree (2)

    7/26

    View Boosting as Linear Model

    Basis expansion:

    use basis function Tm (m=1..M, each Tm is a

    weak learner) to transform inputs vector X into

    T space, then use linear models in this new

    space

    Special for Boosting: Choosing of basisfunction Tm depends on T1, Tm-1

  • 8/6/2019 Boosting and Additive Tree (2)

    8/26

    Improve Boosting as Linear Model

    Recap: Linear Models inChapter 3

    Bias Variance trade off1. Subset selection (feature

    selection, discrete)

    2. Coefficient shrinkage(smoothing: ridge, lasso)

    3. Using derived input direction(PCA, PLA)

    Multiple outcome shrinkage

    and selection Exploit correlations in

    different outcomes

    This Chapter: Improve

    Boosting

    1. Size of the constituent

    trees J

    2. Number of boosting

    iterations M (subset

    selection)

    3.Regularization (Shrinkage)

  • 8/6/2019 Boosting and Additive Tree (2)

    9/26

    Right Size Tree for Boosting (?)

    The Best for one step is not the best in long run

    Using very large tree (such as C4.5) as weak learner tofit the residue assumes each tree is the last one in the

    expansion. Usually degrade performance and increasecomputation

    Simple approach: restrict all trees to be the samesize J

    J limits the input features interaction level of tree-based approximation

    In practice low-order interaction effects tend todominate, and empirically 4J 8 works well (?)

  • 8/6/2019 Boosting and Additive Tree (2)

    10/26

  • 8/6/2019 Boosting and Additive Tree (2)

    11/26

    Number of Boosting Iterations

    (subset selection)

    Boosting will over fit as M ->

    Use validation set

    Other methods (later)

  • 8/6/2019 Boosting and Additive Tree (2)

    12/26

    Shrinkage Scale the contribution of each tree by a

    factor 0

  • 8/6/2019 Boosting and Additive Tree (2)

    13/26

  • 8/6/2019 Boosting and Additive Tree (2)

    14/26

    PenalizedR

    egression Ridge regression or Lasso regression

    ^

    1

    1

    2

    1

    2

    Lasso||)(

    normL2,RegressionRidge)(

    })(*))(({minarg)(

    !

    !

    !

    !

    !

    !

    K

    k

    k

    K

    k

    k

    N

    i k

    ikk

    J

    J

    Jxy

    EE

    EE

    EPEPEE

  • 8/6/2019 Boosting and Additive Tree (2)

    15/26

    Algorithm 4: Forward stagewise

    linear

    !

    ! !

    !

    n

    !

    !

    "!!

    K

    1k k

    *

    1 1

    2

    ,

    **

    ^

    )(f(x).2

    )()

    )))((minarg),()

    :Mto1mFor2.

    largeMandconstantsmallsometo0,,...,1,0Initialize.1

    xOutput

    signb

    xxyka

    setkk

    k

    kk

    N

    i

    K

    l

    ikillk

    k

    E

    FIEE

    FEF

    IE

    F

  • 8/6/2019 Boosting and Additive Tree (2)

    16/26

    (, M ) lasso regression S/t/

    If is

    monotone in, we have

    k|k| = M,

    and the

    solution for

    algorithm 4 is

    identical to

    result of lasso

    regression as

    described in

    page 64.

    )( PE

  • 8/6/2019 Boosting and Additive Tree (2)

    17/26

    More about algorithm 4 Algorithm 4 Algorithm 3 + Shrinkage

    L1 norm vs. L2 norm: more details later

    Chapter 12 after learning SVM

  • 8/6/2019 Boosting and Additive Tree (2)

    18/26

    Interpretation: Understanding the

    final model Single decision trees are easy to interpret

    Linear combination of trees is difficult to

    understand

    Which features are important?

    Whats the interaction between features?

  • 8/6/2019 Boosting and Additive Tree (2)

    19/26

    Relative Importance of Individual

    Variables For a single tree, define the importance of xl as

    For additive tree, define the importance of x l as

    For K-class classification, just treat as K 2-class

    classification task

    !-partitionforxusingnodeallover

    2

    l

    regionover thefitconstantaforoverriskerrorsquareinimprove)(Tl

    !

    -!-M

    m

    mll TM

    1

    22)(

    1

  • 8/6/2019 Boosting and Additive Tree (2)

    20/26

  • 8/6/2019 Boosting and Additive Tree (2)

    21/26

    Partial Dependence Plots Visualize dependence of approximation f(x)

    on the joint values of important features

    Usually the size of the subsets is small (1-3)

    Define average or partial dependence

    Can be estimated empirically using the

    training data:

    c

    X

    cccX dXXPXXfXXfXf

    c

    C)(),(),()( !! ::::

    !

    !N

    i

    ixfN

    f1

    _

    ),(1

    )( :::

  • 8/6/2019 Boosting and Additive Tree (2)

    22/26

    10.50 vs. 10.52

    Same if predictor variables are independent

    Why use 10.50 instead of 10.52 to Measure PartialDependency?

    Example 1: f(X)=h1(xs)+ h2(xc)

    Example 2: f(X)=h1(xs)* h2(xc)

    tonshdhh

    dhhf

    ccc

    ccc

    c

    c

    tan)()()()(

    )())()(()(:50.10

    121

    21

    !!

    !

    ::

    :::

    )|(),()|),(()(:10.52~

    c

    X

    ccc dXXXPXXfXXXfXf

    c

    !! ::::::

    c

    X

    cccX dXXPXXfXXfXf

    c

    )(),(),()(:10.50 !! ::::

  • 8/6/2019 Boosting and Additive Tree (2)

    23/26

  • 8/6/2019 Boosting and Additive Tree (2)

    24/26

  • 8/6/2019 Boosting and Additive Tree (2)

    25/26

    Find the solution based on numericaloptimization

    Control the model complexity and avoid

    over fittingRight sized trees for boosting

    Number of iterations

    R

    egularization Understand the final model (Interpretation)

    Single variable

    Correlation of variables

    Conclusion

  • 8/6/2019 Boosting and Additive Tree (2)

    26/26