1 additive logistic regression: a statistical view of boosting j. friedman, t. hastie, & r....
TRANSCRIPT
1
Additive Logistic Regression:Additive Logistic Regression:a Statistical View of Boostinga Statistical View of Boosting
J. Friedman, T. Hastie, & R. Tibshirani
Journal of Statistics
2
OutlineOutline
Introduction A brief history of boosting Additive Models AdaBoost – an Additive logistic regression model Simulation studies
3
Discrete AdaBoostDiscrete AdaBoost
4
Performance of Discrete AdaBoosPerformance of Discrete AdaBoostt
5
Re-sampling in AdaBoostRe-sampling in AdaBoost
Connection with bagging Bagging is a variance reduction technique. Is Boosting also a variance reduction technique?
Boosting performs comparably well when: Weighted tree-growing algorithm rather than weighted resampling.
Removing the randomizatoin component.
Stumps Have low variance but high bias
Boosting is capable of both bias and variance reduction.
6
Real AdaBoostReal AdaBoost
7
Statistical Interpretation of the AdStatistical Interpretation of the AdaBoostaBoost Fitting an additive model by minimizing squared-e
rror loss in a forward stagewise manner. At the mth stage, fix Minimize squared error to obtain
AdaBoost fits an additive model using a criterion similar to, but not the same as, the binomial log-likelihood Better loss function for classification
8
A brief history of boostingA brief history of boosting
The first simple boosting procedure is developed in the PAC-learning framework.
Strength of Weak Learnability After learning an initial classifier h1 on the first N traini
ng points. h2 is learned on a new sample of N points, half of which
are misclassified by h1.
h3 is learned on N points for which h1 and h2 disagree.
The boosted classifier is hB = Majority Vote(h1, h2, h3)
9
Additive ModelsAdditive Models
Addtive regression models Extended additive models Classification problems
10
Additive Regression ModelsAdditive Regression Models
Modeling the mean The additive model:
There is a separate function for each of the p input variables xj.
Backfitting algorithm A modular “Gauss-Seidel” algorithm for fitting additive models Backfitting update
Backfitting cycles are repeated until convergence Backfitting converges to the minimizer of under fairly
general conditions.
11
Extended Additive Models (1)Extended Additive Models (1)
Additive models whose elements are functions of potentially all of the input features x.
If we set
Generalized backfitting algorithm Updates
Greedy forward stepwise approach
12
Algorithm for fitting a single weak leaner to data
In the forward stepwise procedure
This can be viewed as a procudure for boosting a weak learner to form a powerful commttee
Extended Additive Models (2)Extended Additive Models (2)
13
Classification problemsClassification problems
Additive logistic regression
Inverting
These models are usually fit by maximizing the binomial log-likelihood.
14
AdaBoost – an Additive Logistic RAdaBoost – an Additive Logistic Regression Modelegression Model AdaBoost can be interpreted as stage-wise estimati
on procedures for fitting an additive logistic regression model
AdaBoost optimize an exponential criterion which to second order is equivalent to the binomial log-likelihood criterion
Proposing a more standard likelihood-based boosting procedure
15
An Exponential Criterion (1)An Exponential Criterion (1)
Minimizing the criterion The function F(x) that minimizes J(F) is the symmetric
logistic transform of P(y=1|x).
Can be proved by setting the derivative to zero
16
An Exponential Criterion (2)An Exponential Criterion (2)
The usual logistic model
The Discrete AdaBoost algorithm (population version) builds an additive logistic regression model via Newton-like update for minimizing
17
DerivationDerivation
, where
For c > 0, minimizing (a) is equivalent to maximizing
(a)
18
Continued…Continued…
The solution is
Note that
Minimizing a quadratic approximation to the critetrion leadsto a weighted least-sqares choice of f(x).
Minimizing J(F+cf) to determine c:
where,
19
Update for Update for F(x)F(x)
Since
The function and weight updates are of an identical formto those used in Discrete AdaBoost
20
CorollaryCorollary
After each update to the weights, the weighted misclassification error of the most recent weak learner is 50%
Weights are updated to make the new weighted problem maximally difficult for the next weak learner
21
DerivationDerivation
The Real AdaBoost algorithm fits an additive logistic regression model by stage-wise and approximate optimization of
Dividing through by And setting the derivative w.r.t f(x) to zero
22
CorollaryCorollary
At the optimal F(x), the weighted conditional mean of y is 0.
23
Why Why EeEe-yF(x)-yF(x)??
The populbation minimizer of and coincide.
24
Losses as Approximations to Losses as Approximations to Misclassification ErrorMisclassification Error
25
Fitting additive logistic regression models by stage-wise optimization of the Bernoulli log-likelihood.
Direct optimization of the Direct optimization of the binomial log-likelihoodbinomial log-likelihood
26
Derivation of LogitBoost (1)Derivation of LogitBoost (1)
Newton update
,where
27
Derivation of LogitBoost (2)Derivation of LogitBoost (2)
Equivalently, the Newton update f(x) solves the weighted least-squares approximation to the log-likelihood
28
Optimizing Optimizing EeEe-yF(x) -yF(x) by Newton steppiby Newton steppingng Proposing the “Gentle AdaBoost” procedure that i
nstead takes adaptive Newton steps much like the LogitBoost algorithm just described
29
DerivationDerivation
The Gentle AdaBoost algorithm uses Newton steps for minimizing Ee-yF(x) .
Newton update
30
Comparison with Real AdaBoostComparison with Real AdaBoost
Update in Gentle AdaBoost
Update in Real AdaBoost
Log-ratios can be numerically unstable, leading to very large updates in pure regions.
Empirical evidence suggests that this more conservative algorithm has similar performance to both the Real AdaBoost and LogitBoost algorightms.
31
Simulation StudiesSimulation Studies
Four boosting methods compared here DAB: Discrete AdaBoost RAB: Real AdaBoost LB: LogitBoost GAB: Gentle AdaBoost
32
Data GenerationData Generation
All of the simulated examples involve fairly complex decision boundaries
Ten input features randomly drawn from a 10-dim. standard normal dist.
Approximately 1000 training observations in each class 10000 observations for test set. Averaged over 10 such indepently drawn training/test set combination
s.
C1 C2 C3
33
Additive Decision Boundary Additive Decision Boundary (1)(1)
34
Additive Decision Boundary Additive Decision Boundary (2)(2)
35
Additive Decision Boundary Additive Decision Boundary (3)(3)
36
Boosting Trees with 8-Boosting Trees with 8-terminal nodesterminal nodes
37
AnalysisAnalysis
Optimal decision boundary for the above examples is also additive in the original features, with
For RAG, GAB, and LB the error rate using the bigger trees is in fact 33% higher than that for stumps at 800 iterations, even though the former is four times more complex.
Non-additive decision boundaries Boosting stumps would be less advantageous than
using larger trees.
38
Non-additive Decision Non-additive Decision BoundariesBoundaries Higher order basis functions provide the
possibility to more accurately estimate those decision boundaries with high order interaction.
Data generation 2 classes 5000 training observations drawn from a 10-dim
normal dist. Class labels were randomly assigned to each
observation with log-odds
39
Non-additive Decision Non-additive Decision Boundaries (2)Boundaries (2)
40
Non-additive Decision Non-additive Decision Boundaries (3)Boundaries (3)
41
AnalysisAnalysis
Boosting stumps can sometimes be superior to using larger trees when decision boundaries can be closely approximated by functions that are additive in the original predictor features.
42
Some experiments with Real Some experiments with Real World DataWorld Data UC-Irvine machine learning archive + a popular
simulated dataset. The real data examples fail to demonstrate
performance differences between the various boosting methods.
43
Additive Logistic TreesAdditive Logistic Trees
ANOVA decomposition
Allowing the base classifier to produce higher order interactions can reduce the accuracy of the final boosted model. Higher order interactions are produced by deeper trees.
Maximum depth becomes a “meta-parameter” of the procedure to be estimated by some model selection technique, such as cross-validation.
44
Additive Logistic Trees (2)Additive Logistic Trees (2)
Growing trees until a maximum number M of terminal nodes are induced.
“Additive logistic trees” (ALT) Combination of truncated best-first trees, with
boosting.
Another advantage of low order approximations is model visualization
45
Weight TrimmingWeight Trimming
Trainig observations with weight wi less than a threshold
Observations deleted at a particular iteration may therefore re-enter at later iterations.
LogitBoost sometimes gives an advantage with weight trimming Weights measre nearness to the currently estimated decisi
on boundary
For the other three procedures the weight is monotone in Subsample passed to the base learner can be highly unbalanced
46
The test error for the letter The test error for the letter recognition problemrecognition problem
47
Further Generalizations of Further Generalizations of BoostingBoosting The Newton step can be replaced by a gradient ste
p, slowing down the fitting procedure. Reducing susceptibility to overfitting
Any smooth loss function can be used.
48
Concluding RemarksConcluding Remarks
Bagging, randomized trees “Variance” reducing techniques
Boosting Appears tobe mainly a “bias” reducing procedure.
Boosting seems resistant to overfitting As the LogitBoost iterations proceed, the overall impact of change
s introcued by fm(x) reduces. The stage-wise nature of the boosting algorithms does not allow th
e full collection of parameters to be jointly fit, and thus has far lower variance than the full parameterization might suggest.
Classifiers are hurt less by overfitting than other function estimators