ensemble classification methods rayid ghani ir seminar – 9/26/00
TRANSCRIPT
![Page 1: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/1.jpg)
Ensemble Classification Methods
Rayid Ghani
IR Seminar – 9/26/00
![Page 2: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/2.jpg)
What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the
individual classifiers What properties should the base
learners have?
![Page 3: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/3.jpg)
Why should it work? More accurate ONLY if the individual
classifiers disagree Error rate < 0.5 and errors are
independent Error rate is highly correlated with the
correlations of the errors made by the different learners (Ali & Pazzani)
![Page 4: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/4.jpg)
Averaging Fails! Use Delta-functions as classifiers (predict +1 at a
point and –1 everywhere else) For training sample size m, construct a set of at
most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that
predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere
Applying boosting to this results in zero training error but bad generalizations
Applying the margin analysis results in zero training error but margin is small O(1/m)
![Page 5: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/5.jpg)
Ideas? Subsampling training examples
Bagging , Cross-Validated Committees, Boosting Manipulating input features
Choose different features Manipulating output targets
ECOC and variants Injecting randomness
NN(different initial weights), DT(pick different splits), injecting noise, MCMC
![Page 6: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/6.jpg)
Combining Classifiers Unweighted Voting
Bagging, ECOC etc. Weighted Voting
Weight accuracy (training or holdout set), LSR (weights 1/variance)
Bayesian model averaging
![Page 7: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/7.jpg)
BMA All possible models in the model
space used weighted by their probability of being the “Correct” model
Optimal given the correct model space and priors
Not widely used even though it was said not to overfit (Buntine, 1990)
![Page 8: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/8.jpg)
BMA - Equations
)|(),(
)(),|(
1, hcxP
cxP
hPcxhP i
n
ii
priorlikelihood
),|()|()|,( hxcPhxPhcxP iiiii
noise model
),|(),|(),,,|( cxhPhxcPHcxxCPHh
![Page 9: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/9.jpg)
Equations Posterior Uniform Noise Model Pure classification model Model space too large –
approximation required Model with highest posterior, Sampling
![Page 10: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/10.jpg)
BMA of Bagged C4.5 Rules Bagging as a form of importance
sampling where all samples are weighed equally
Experimental Results Every version of BMA performed worse
than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single
rule model – model selection rather than averaging
![Page 11: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/11.jpg)
BMA of various learners RISE Rule sets with partitioning
8 databases from UCI BMA worse than RISE in every domain
Trading Rules Intuition (there is no single right rule so
BMA should help) BMA similar to choosing the single best
rule
![Page 12: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/12.jpg)
Overfitting in BMA Issue of overfitting is usually ignored (Freund et
al. 2000) Is overfitting the explanation for the poor
performance of BMA? Preferring a hypothesis that does not truly have
the lowest error of any hypothesis considered, but by chance has the lowest error on training data.
Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered
![Page 13: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/13.jpg)
To BMA or not to BMA? Net effect will depend on which effect
prevails? Increased overfitting (small if few models are
considered) Reduction in error obtained by giving some
weight to alternative models (skewed weights => small effect)
Ali & Pazzani (1996) report good results but bagging wasn’t tried
Domingos (2000) used bootstrapping before BMA so the models were built from less data
![Page 14: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/14.jpg)
Why they work? Bias / Variance Decomposition Training data insufficient for
choosing a single best classifier Learning algorithms not “smart”
enough! Hypothesis space may not contain
the true function
![Page 15: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/15.jpg)
Definitions Bias is the persistent/systematic error of a
learner independent of the training set. Zero for a learner that always makes the optimal prediction
Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set
![Page 16: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/16.jpg)
Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be
negative and noise is ignored Breiman (1996) – undefined for any given
example and variance can be zero even when the learners predictions fluctuate
Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the
Bayes optimal classifier to be non-zero Friedman (1997) leaves bias and variance for
zero-one loss undefined
![Page 17: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/17.jpg)
Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire
et al. 1997) using the decomposition Incorporates variable
misclassification costs Experimental study
![Page 18: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/18.jpg)
Unified Decomposition Loss functions
Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1
Goal = Minimize average L(t,y) over all weighted examples
c1N(x) + B(x) + c2V(x)
![Page 19: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/19.jpg)
Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination
of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.
![Page 20: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/20.jpg)
Experimental Study 30 UCI datasets Methodology
100 bootstrap samples – averaged over the test set with uniform weights
Estimate bias, variance, zero-one loss DT, kNN, boosting
![Page 21: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/21.jpg)
Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in
the first few rounds Variance reduction is more gradual
and the dominant effect
![Page 22: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/22.jpg)
kNN results kNN bias increases with k dominates
variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.
![Page 23: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/23.jpg)
Issues Does not work with “Any” loss
function e.g. absolute loss Decomposition is not purely additive
unlike the original one for squared-loss
![Page 24: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/24.jpg)
Spectrum of ensembles
Asymmetry of weights
Overfitting
Bagging
Boosting
BMA
![Page 25: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/25.jpg)
Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible
![Page 26: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649ec65503460f94bd163c/html5/thumbnails/26.jpg)
Bibliography Overview
T. Dietterich Bauer & Kohavi
Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani
Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich