cs7616 pattern recognition – a. bobick cs6716 pattern …afb/classes/cs7616-spring2014/... ·...

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS6716 Pattern Recognition Bootstrapping, Bagging, Stacking


Administrivia • End of Chapter 7 and most of 8 of the Hastie book. • Slides brought to you buy Bibhas Chakraborty and

friends • New problems set out tonight or tomorrow.

• Similar datasets • Apply K-NN and Naïve Bayes methods

• For NB, do density estimation for each feature

• For both experiment with Cross-validation (and maybe BIC) • For large datasets, create a real holdout set; use different CV to find best K

or other variables to predict test error; then actually measure test error.


Before the snow • We were looking at AIC, BIC and cross-validation to

determine “best” parameters to use. • Goal is prevent over-fitting to training data and to

predict actual error rate. • Over all CV strategy:

• Pick many sets of training samples and train regressor or classifier.

• Overall average behavior should be predictive of new data

• But there is more we can do with multiple sets


Bootstrap Method • General tool for assessing statistical accuracy. • Suppose we have a model to fit the training data:

• The idea is to draw random samples with

replacement of size 𝑁𝑁 from the training data. This process is repeated 𝐵𝐵 times to get 𝐵𝐵 bootstrap datasets.

• Refit the model to each of the bootstrap datasets and examine the behavior of the fits over 𝐵𝐵 replications.

( ){( , ), 1,..., }i i or M NZ x y i N >= =


Bootstrap The basic idea: Randomly draw datasets with replacement from the training data, each sample the same size as the original training set


• Let 𝑆𝑆(𝑍𝑍) be any quantity computed from the data 𝑍𝑍. From the bootstrap sampling, we can estimate any aspect of the distribution of 𝑆𝑆(𝑍𝑍).

• For example, its variance is estimated by where 𝑆𝑆̅∗ = � 𝑆𝑆(𝑍𝑍∗𝑏𝑏 ) 𝐵𝐵⁄ .𝑏𝑏

Bootstrap (Cont’d)

* * 2

1

1( ( )) ( ( ) ) ,1

Bb

bS Z S Z SVar

B == −∑

−


Bootstrap error • Bootstrap error:

• Problem: the test data are in the training data

*

1 1

1 1 ˆ( , ( )).B N

bootb

i ib i

L y f xN

EB

rr= =

= ∑∑


Bootstrap over-fitting • Consider a very simple system: Two classes, equally likely,

and the labels 𝑦𝑦𝑖𝑖 are independent of 𝑥𝑥𝑖𝑖 . So the error rate (Bayes rate) should be 50%.

• Now, consider a 1-NN method applied to bootstrap training sets. If a sample is in the training set then, error is 0 otherwise, 0.5

• So error 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 = 0.5 ∗ .368 = 0.184 Much too low…


Bootstrap: Mimic cross validation • Fit the model on a set of bootstrap samples keeping

track of predictions from bootstrap samples not containing that observation.

• The leave-one-out bootstrap estimate of prediction error is

where 𝐶𝐶−𝑖𝑖 is set of samples that do no contain i • Solves the over-fitting problem suffered by 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏

but has training-set-size bias, mentioned in the discussion of CV. So probably too high.

(1)*

1

1 1 ˆ( , ( )).| | i

Nb

i iii b CL y f x

N CErr

−−= ∈= ∑ ∑


The “0.632 Estimator” • Average number of distinct observations in each

bootstrap sample is approximately 0.632 ⋅ 𝑁𝑁 (Why?) • Bias will roughly behave like that of two-fold cross-

validation (biased upwards) – will over estimate error.

• The “0.632 estimator” is designed to get rid of this bias:

• Still too low in this case (0.316 vs 0.5) but better. [Why is 𝑒𝑒𝐸𝐸𝐸𝐸 equal to 0?] See Hastie pg 252 for improvement.

.ˆ632.0368.0ˆ )1()632.0( rrEerrrrE ⋅+⋅=


Bootstrapping, MLE and Bayes • Consider fitting some data with a linear combination

of K basis functions (here 7 cubic B-spline basis): ℎ𝑖𝑖 𝑥𝑥 𝑁𝑁 = 50


• Fitting linear system. Assume points < 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 >. Want predictor:

• Let H be the 𝑁𝑁x7matrix where 𝑖𝑖𝑗𝑗𝑏𝑏𝑡element is ℎ𝑗𝑗 𝑥𝑥𝑖𝑖 (call each row is 𝒉𝒉 𝑥𝑥𝑖𝑖 𝑇𝑇 )

We want to solve: H β = 𝒚𝒚

• Least squares solution (MLE?): �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚

Linear fitting


Least squares spline fit for all data �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚


Linear fit and variance • Can show (Hastie eq 8.3) variance can be estimated

by :

• Where 𝜎𝜎2 is estimated by:

• Then the standard error of 𝜇𝜇� 𝑥𝑥 = 𝒉𝒉 𝑥𝑥 𝑇𝑇�̂�𝛽 is:


“Error bars” – sense of the distribution �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚


Now bootstrap the same data… • Choose 50 samples from the data (remember, with

replacement) • For each sample set compute �̂�𝛽. 10 sets:


Now estimate variance too 95% Conf

For N=200, 95% means 5th smallest

and biggest


• Previous method is “non-parametric” bootstrap – generate sample sets (with replacement)

• Parametric: Assume a Gaussian model of known (or estimated) variance. Fit model using all the data, perturb the predictions by Gaussian noise:

• Do this 𝐵𝐵 times. Statistics of predictor:

• Same mean and variance as before…if Gaussian. See Hastie for more.

• Non-parametric is easier…

Parametric Bootstrap


Now Bayesian • For same fitting example, to be Bayesian we need

priors on 𝛽𝛽. For example: 𝛽𝛽~𝑁𝑁 0, 𝜏𝜏𝚺𝚺

where 𝜏𝜏 controls how certain the prior is. Large 𝜏𝜏 is “non-informative” prior:

Look familiar?


Take home message • Bayesian with a “non-informative” prior results in a

posterior distribution over parameters with no prior influence.

• That distribution will be “broad” if a large range of values of the parameters could yield high likelihoods of the data. Will be “peaked” if only a small range is plausible.

• Another way: for small perturbations of the data can you get large variations of the MLE parameters?

• Bootstrapping (non-parametric) yields data perturbations that reveal the possible parameter variations.

• Bootstrap (non-parametric) is “poor man Bayes” but much easier to implement.


Bagging • Introduced by Breiman (Machine Learning, 1996). • Acronym for ‘Bootstrap aggregation’ . • In regression, it averages the prediction over a

collection of bootstrap samples, thus reducing the variance in prediction.

• For classification, a committee (or ensemble – later) of classifiers each cast a vote for the predicted class.


Bagging in regression • Consider the regression problem with training data 𝑍𝑍 = {(𝑥𝑥1,𝑦𝑦1), . . . , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)}. As always, we will fit a model and get a prediction 𝑓𝑓(𝑥𝑥� at the input 𝑥𝑥.

• For each bootstrap sample set 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 fit the model, get the prediction 𝑓𝑓∗𝑏𝑏(𝑥𝑥) .

• Then the bagging (or, bagged) estimate is:

• Note: Bagging only matters if 𝑓𝑓∗𝑏𝑏 is non-linear. (Why?)

.)(ˆ1)(ˆ1

*∑=

=B

b

bbag xf

Bxf


• Let 𝐺𝐺� be a classifier for a K-class response. Consider an underlying indicator vector function

the entry in the 𝑖𝑖𝑏𝑏𝑡 place is 1 if the prediction for 𝑥𝑥 is the 𝑖𝑖𝑏𝑏𝑡 class, such that

• Then the bagged estimate 𝑓𝑓𝑏𝑏𝑏𝑏𝑏𝑏 𝑥𝑥 = 𝑝𝑝1, … , 𝑝𝑝𝐾𝐾 where 𝑝𝑝𝑘𝑘 is the proportion of base classifiers predicting

class 𝑘𝑘 at 𝑥𝑥 where 𝑘𝑘 = 1, . . . ,𝐾𝐾. • Finally:

Bagging for classification

ˆˆ ( ) arg max ( ).kG x f x=

ˆ ( ) (0,...,0,1,0,...,0)f x =

).(ˆmaxarg)(ˆ xfxG bagk

bag =


Bagging : an simulated example • Generated a sample of size N = 30, with two classes

and p = 5 features, each having a standard Gaussian distribution with pairwise correlation 0.95.

• The label Y was generated according to Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 0|x1 > 0.5) = 0.8.

• So Bayes error rate is .2


Digression: classification trees in 1 slide…


Decision Tree Example

Income

Debt



t1 Income

Debt Income > t1

??



t1

t2

Income

Debt Income > t1

Debt > t2

??



t1 t3

t2

Income

Debt Income > t1

Debt > t2

Income > t3



t1 t3

t2

Income

Debt Income > t1

Debt > t2

Income > t3 Note: tree boundaries are piecewise linear and axis-parallel


Bagging Trees Notice the bootstrap trees are different than the original tree


Good bagging


The key is independence


Bagging is not always good? • Decision tree – easy to interpret. Bagged trees

(forests?) – not so much. • Bagging makes good classifiers better but can make

bad classifiers worse.

• Simple example: Suppose 𝑌𝑌 = 1 for all 𝑥𝑥. And suppose all 𝐺𝐺�𝑘𝑘(𝑥𝑥) predict 𝑌𝑌 = 1 for 40% of the cases. • Then the misclassification rate for 𝐺𝐺�𝑘𝑘(𝑥𝑥) is 60% but for the

bagged classifier it’s 100%.


Bagging can only do so much


Model averaging • MLE - using the entire training set to estimate 𝜃𝜃 –

corresponds to the mode of the distribution over 𝜃𝜃, the maximum of the posterior given the non-informative prior. MAP estimate would be the mode given a prior distribution.

• Bootstrapping gives a sense of the distribution. • Bagging is the mean – so it minimizes a squared error

loss.

• These are all ways of thinking about model averaging.


Bayesian Model Averaging • Suppose you want to estimate some quantity 𝜁𝜁 • Candidate models: 𝑀𝑀𝑚𝑚,𝑚𝑚 = 1, … ,𝑀𝑀 • Posterior distribution and mean:

• Bayesian prediction (posterior mean) is a weighted

average of individual predictions, with weights proportional to posterior probability of each model.

• How would you guess those probabilities?

),|Pr(),|Pr()|Pr(1

ZMZMZ mm

M

mζζ ∑

=

=

).|Pr(),|()|(1

ZMZMEZE mm

M

mζζ ∑

=

=


Bayesian model averaging • “Committee methods” combine the models with an

un-weighted average, assumes equal probability of each model.

• Or if you had some estimate of model quality, you would use that? How do we evaluate the “goodness” of a model? • BIC – quality of the model is a function of the fit and the DOF

of the model.


Frequentist Model Averaging • Given predictions 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 under squared error loss, we can

seek the weights such that

• The solution is the population least squares linear regression of 𝑌𝑌 on 𝐹𝐹� 𝑥𝑥 𝑇𝑇 ≡ 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 ,

• Can show combining models never makes things worse, at the population level.

• But true population density is not available, apply the only to training set. But this is not going to work well. Imagine the different models have different DOFs. Which will be selected?

∑=

−=M

mmmP

wxfwYEw

1

2.)](ˆ[minargˆ

].)(ˆ[])(ˆ)(ˆ[ˆ 1 YxFExFxFEw PT

P−=


• Stacked generalization, or stacking is a way to get around the problem.

• The stacking weights are given by

where 𝑓𝑓𝑚𝑚 −𝑖𝑖 are the estimates without 𝑖𝑖𝑏𝑏𝑡 example.

• The final predictor is: ∑ 𝑤𝑤𝑚𝑚𝑠𝑠𝑏𝑏 𝑓𝑓𝑚𝑚 𝑥𝑥𝑀𝑀𝑚𝑚=1

• Close connection with leave-out-one-cross-validation.

Stacking

2

1 1

ˆˆ arg min [ ( )]N M

st i

i m m ii mw

w y w f x−

= == −∑ ∑


One more…


Bumping • Sometimes, models get stuck in local minima. • Example: trees found by splitting along axis


Bumping (cont) • Bumping is like bagging but instead of average, pick

best one on original data: • Select B bootstrap samples: 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 and fit

models 𝑓𝑓∗𝑏𝑏 𝑥𝑥 ,𝑏𝑏 = 1, …𝐵𝐵 • Then choose model that gives best fit over all data:

• Fixes two problems: • Pathological points • The need to ignore some of the data to get started well…


Bumping with trees

With only 20 bumped bootstrap samples!


References

• Hastie,T., Tibshirani, R. and Friedman, J.-The Elements of Statistical Learning (ch. 7 and 8)

cs7616 pattern recognition – a. bobick cs6716 pattern …afb/classes/cs7616-spring2014/... ·...

Documents