1 additive logistic regression: a statistical view of boosting j. friedman, t. hastie, & r....

48
1 Additive Logistic Additive Logistic Regression: Regression: a Statistical View of a Statistical View of Boosting Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

Upload: herbert-dorsey

Post on 03-Jan-2016

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

1

Additive Logistic Regression:Additive Logistic Regression:a Statistical View of Boostinga Statistical View of Boosting

J. Friedman, T. Hastie, & R. Tibshirani

Journal of Statistics

Page 2: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

2

OutlineOutline

Introduction A brief history of boosting Additive Models AdaBoost – an Additive logistic regression model Simulation studies

Page 3: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

3

Discrete AdaBoostDiscrete AdaBoost

Page 4: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

4

Performance of Discrete AdaBoosPerformance of Discrete AdaBoostt

Page 5: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

5

Re-sampling in AdaBoostRe-sampling in AdaBoost

Connection with bagging Bagging is a variance reduction technique. Is Boosting also a variance reduction technique?

Boosting performs comparably well when: Weighted tree-growing algorithm rather than weighted resampling.

Removing the randomizatoin component.

Stumps Have low variance but high bias

Boosting is capable of both bias and variance reduction.

Page 6: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

6

Real AdaBoostReal AdaBoost

Page 7: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

7

Statistical Interpretation of the AdStatistical Interpretation of the AdaBoostaBoost Fitting an additive model by minimizing squared-e

rror loss in a forward stagewise manner. At the mth stage, fix Minimize squared error to obtain

AdaBoost fits an additive model using a criterion similar to, but not the same as, the binomial log-likelihood Better loss function for classification

Page 8: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

8

A brief history of boostingA brief history of boosting

The first simple boosting procedure is developed in the PAC-learning framework.

Strength of Weak Learnability After learning an initial classifier h1 on the first N traini

ng points. h2 is learned on a new sample of N points, half of which

are misclassified by h1.

h3 is learned on N points for which h1 and h2 disagree.

The boosted classifier is hB = Majority Vote(h1, h2, h3)

Page 9: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

9

Additive ModelsAdditive Models

Addtive regression models Extended additive models Classification problems

Page 10: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

10

Additive Regression ModelsAdditive Regression Models

Modeling the mean The additive model:

There is a separate function for each of the p input variables xj.

Backfitting algorithm A modular “Gauss-Seidel” algorithm for fitting additive models Backfitting update

Backfitting cycles are repeated until convergence Backfitting converges to the minimizer of under fairly

general conditions.

Page 11: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

11

Extended Additive Models (1)Extended Additive Models (1)

Additive models whose elements are functions of potentially all of the input features x.

If we set

Generalized backfitting algorithm Updates

Greedy forward stepwise approach

Page 12: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

12

Algorithm for fitting a single weak leaner to data

In the forward stepwise procedure

This can be viewed as a procudure for boosting a weak learner to form a powerful commttee

Extended Additive Models (2)Extended Additive Models (2)

Page 13: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

13

Classification problemsClassification problems

Additive logistic regression

Inverting

These models are usually fit by maximizing the binomial log-likelihood.

Page 14: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

14

AdaBoost – an Additive Logistic RAdaBoost – an Additive Logistic Regression Modelegression Model AdaBoost can be interpreted as stage-wise estimati

on procedures for fitting an additive logistic regression model

AdaBoost optimize an exponential criterion which to second order is equivalent to the binomial log-likelihood criterion

Proposing a more standard likelihood-based boosting procedure

Page 15: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

15

An Exponential Criterion (1)An Exponential Criterion (1)

Minimizing the criterion The function F(x) that minimizes J(F) is the symmetric

logistic transform of P(y=1|x).

Can be proved by setting the derivative to zero

Page 16: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

16

An Exponential Criterion (2)An Exponential Criterion (2)

The usual logistic model

The Discrete AdaBoost algorithm (population version) builds an additive logistic regression model via Newton-like update for minimizing

Page 17: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

17

DerivationDerivation

, where

For c > 0, minimizing (a) is equivalent to maximizing

(a)

Page 18: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

18

Continued…Continued…

The solution is

Note that

Minimizing a quadratic approximation to the critetrion leadsto a weighted least-sqares choice of f(x).

Minimizing J(F+cf) to determine c:

where,

Page 19: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

19

Update for Update for F(x)F(x)

Since

The function and weight updates are of an identical formto those used in Discrete AdaBoost

Page 20: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

20

CorollaryCorollary

After each update to the weights, the weighted misclassification error of the most recent weak learner is 50%

Weights are updated to make the new weighted problem maximally difficult for the next weak learner

Page 21: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

21

DerivationDerivation

The Real AdaBoost algorithm fits an additive logistic regression model by stage-wise and approximate optimization of

Dividing through by And setting the derivative w.r.t f(x) to zero

Page 22: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

22

CorollaryCorollary

At the optimal F(x), the weighted conditional mean of y is 0.

Page 23: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

23

Why Why EeEe-yF(x)-yF(x)??

The populbation minimizer of and coincide.

Page 24: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

24

Losses as Approximations to Losses as Approximations to Misclassification ErrorMisclassification Error

Page 25: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

25

Fitting additive logistic regression models by stage-wise optimization of the Bernoulli log-likelihood.

Direct optimization of the Direct optimization of the binomial log-likelihoodbinomial log-likelihood

Page 26: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

26

Derivation of LogitBoost (1)Derivation of LogitBoost (1)

Newton update

,where

Page 27: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

27

Derivation of LogitBoost (2)Derivation of LogitBoost (2)

Equivalently, the Newton update f(x) solves the weighted least-squares approximation to the log-likelihood

Page 28: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

28

Optimizing Optimizing EeEe-yF(x) -yF(x) by Newton steppiby Newton steppingng Proposing the “Gentle AdaBoost” procedure that i

nstead takes adaptive Newton steps much like the LogitBoost algorithm just described

Page 29: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

29

DerivationDerivation

The Gentle AdaBoost algorithm uses Newton steps for minimizing Ee-yF(x) .

Newton update

Page 30: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

30

Comparison with Real AdaBoostComparison with Real AdaBoost

Update in Gentle AdaBoost

Update in Real AdaBoost

Log-ratios can be numerically unstable, leading to very large updates in pure regions.

Empirical evidence suggests that this more conservative algorithm has similar performance to both the Real AdaBoost and LogitBoost algorightms.

Page 31: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

31

Simulation StudiesSimulation Studies

Four boosting methods compared here DAB: Discrete AdaBoost RAB: Real AdaBoost LB: LogitBoost GAB: Gentle AdaBoost

Page 32: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

32

Data GenerationData Generation

All of the simulated examples involve fairly complex decision boundaries

Ten input features randomly drawn from a 10-dim. standard normal dist.

Approximately 1000 training observations in each class 10000 observations for test set. Averaged over 10 such indepently drawn training/test set combination

s.

C1 C2 C3

Page 33: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

33

Additive Decision Boundary Additive Decision Boundary (1)(1)

Page 34: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

34

Additive Decision Boundary Additive Decision Boundary (2)(2)

Page 35: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

35

Additive Decision Boundary Additive Decision Boundary (3)(3)

Page 36: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

36

Boosting Trees with 8-Boosting Trees with 8-terminal nodesterminal nodes

Page 37: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

37

AnalysisAnalysis

Optimal decision boundary for the above examples is also additive in the original features, with

For RAG, GAB, and LB the error rate using the bigger trees is in fact 33% higher than that for stumps at 800 iterations, even though the former is four times more complex.

Non-additive decision boundaries Boosting stumps would be less advantageous than

using larger trees.

Page 38: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

38

Non-additive Decision Non-additive Decision BoundariesBoundaries Higher order basis functions provide the

possibility to more accurately estimate those decision boundaries with high order interaction.

Data generation 2 classes 5000 training observations drawn from a 10-dim

normal dist. Class labels were randomly assigned to each

observation with log-odds

Page 39: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

39

Non-additive Decision Non-additive Decision Boundaries (2)Boundaries (2)

Page 40: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

40

Non-additive Decision Non-additive Decision Boundaries (3)Boundaries (3)

Page 41: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

41

AnalysisAnalysis

Boosting stumps can sometimes be superior to using larger trees when decision boundaries can be closely approximated by functions that are additive in the original predictor features.

Page 42: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

42

Some experiments with Real Some experiments with Real World DataWorld Data UC-Irvine machine learning archive + a popular

simulated dataset. The real data examples fail to demonstrate

performance differences between the various boosting methods.

Page 43: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

43

Additive Logistic TreesAdditive Logistic Trees

ANOVA decomposition

Allowing the base classifier to produce higher order interactions can reduce the accuracy of the final boosted model. Higher order interactions are produced by deeper trees.

Maximum depth becomes a “meta-parameter” of the procedure to be estimated by some model selection technique, such as cross-validation.

Page 44: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

44

Additive Logistic Trees (2)Additive Logistic Trees (2)

Growing trees until a maximum number M of terminal nodes are induced.

“Additive logistic trees” (ALT) Combination of truncated best-first trees, with

boosting.

Another advantage of low order approximations is model visualization

Page 45: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

45

Weight TrimmingWeight Trimming

Trainig observations with weight wi less than a threshold

Observations deleted at a particular iteration may therefore re-enter at later iterations.

LogitBoost sometimes gives an advantage with weight trimming Weights measre nearness to the currently estimated decisi

on boundary

For the other three procedures the weight is monotone in Subsample passed to the base learner can be highly unbalanced

Page 46: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

46

The test error for the letter The test error for the letter recognition problemrecognition problem

Page 47: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

47

Further Generalizations of Further Generalizations of BoostingBoosting The Newton step can be replaced by a gradient ste

p, slowing down the fitting procedure. Reducing susceptibility to overfitting

Any smooth loss function can be used.

Page 48: 1 Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

48

Concluding RemarksConcluding Remarks

Bagging, randomized trees “Variance” reducing techniques

Boosting Appears tobe mainly a “bias” reducing procedure.

Boosting seems resistant to overfitting As the LogitBoost iterations proceed, the overall impact of change

s introcued by fm(x) reduces. The stage-wise nature of the boosting algorithms does not allow th

e full collection of parameters to be jointly fit, and thus has far lower variance than the full parameterization might suggest.

Classifiers are hurt less by overfitting than other function estimators