1 eric boosting304finalrpdf

Upload: christina-jenkins

Post on 08-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    1/19

    BOOSTING

    (ADABOOST ALGORITHM)Eric Emer

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    2/19

    Consider Horse-Racing Gambler

    Rules of Thumb for determining Win/Loss: Most favored odds Fastest recorded lap time Most wins recently, say, in the past 1 month

    Hard to determine how he combines analysis of featureset into a single bet.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    3/19

    Consider MIT Admissions

    2-class system (Admit/Deny) Both Quantitative Data and Qualitative Data

    We consider (Y/N) answers to be Quantitative (-1,+1) Region, for instance, is qualitative.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    4/19

    Rules of Thumb, Weak Classifiers

    Easy to come up with rules of thumb that correctly classify the training data atbetter than chance.

    E.g. IF GoodAtMath==Y THEN predict Admit. Difficult to find a single, highly accurate prediction rule. This is where our Weak

    Learning Algorithm, AdaBoost, helps us.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    5/19

    What is a Weak Learner?

    For any distribution, with high probability, givenpolynomially many examples and polynomial time we can

    find a classifier with generalization errorbetter than

    random guessing.

    0 for generalization error ( 1

    2 )

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    6/19

    Weak Learning Assumption

    We assume that our Weak Learning Algorithm (WeakLearner) can consistently find weak classifiers (rules of

    thumb which classify the data correctly at better than

    50%)

    Given this assumption, we can use boosting to generate asingle weighted classifier which correctly classifies our

    training data at 99%-100%.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    7/19

    AdaBoost Specifics

    How does AdaBoost weight training examples optimally? Focus on difficult data points. The data points that have been

    misclassified most by the previous weak classifier.

    How does AdaBoost combine these weak classifiers into acomprehensive prediction?

    Use an optimally weighted majority vote of weak classifier.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    8/19

    AdaBoost Technical Description

    Missing details: How to generate distribution? How to get single classifier?

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    9/19

    Constructing Dt

    D1(i) =1

    m

    and given Dt and ht:

    Dt+1 = Dt(i)Zt

    c(x)

    c(x) =

    et : yi = ht(xi)et : yi 6= ht(xi)

    Dt+1 = Dt(i)Zt

    etyiht(xi)

    here Zt = normalization constant

    t =

    1

    2 ln

    1 tt > 0

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    10/19

    Getting a Single Classifier

    Hfinal(x) = sign(Xt

    tht(x))

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    11/19

    Mini-Problem

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    12/19

    Training Error Analysis

    Claim:

    then,

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    13/19

    Proof

    Step 1: unwrapping the recurrence

    Step 2: Show

    Step 3: Showtraining error(Hfinal)

    Qt Zt

    Zt = 2p

    t(1 t)

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    14/19

    How might test error react to AdaBoost?

    We expect to encounter:

    Occams Razor

    Overfitting

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    15/19

    Empirical results of test error

    Test error does not increase even after 1000 rounds.Test error continues to drop after training error reaches zero.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    16/19

    Difference from Expectation: The Margins

    Explanation Our training error only measures correctness of

    classifications, neglects confidence of classifications. How

    can we measure confidence of classifications?

    Margin(x,y) close to +1 is high confidence, correct. Margin(x,y) close to -1 is high confidence, incorrect. Margin(x,y) close to 0 is low confidence.

    Hfinal(x) = sign(f(x))

    f(x) =

    Pt thtPt t

    2 [1, 1]

    margin(x, y) = yf(x)

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    17/19

    Empirical Evidence Supporting Margins

    Explanation

    Cumulative distribution of margins on training examples

    Hfinal(x) = sign(f(x))

    f(x) =

    Pt thtPt t

    2 [1, 1]

    margin(x, y) = yf(x)

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    18/19

    Pros/Cons of AdaBoost

    Pros

    Fast Simple and easy to programNo parameters to tune(except T)

    No prior knowledge neededabout weak learner

    Provably effective givenWeak Learning Assumption

    versatile

    Cons

    Weak classifiers toocomplex leads to

    overfitting. Weak classifiers too weak

    can lead to low margins,

    and can also lead to

    overfitting. From empirical evidence,AdaBoost is particularly

    vulnerable to uniform

    noise.

  • 8/22/2019 1 Eric Boosting304FinalRpdf

    19/19

    Predicting College Football ResultsTraining Data: 2009 NCAAF SeasonTest Data: 2010 NCAAF Season