cs7616 pattern recognition – a. bobick cs6716 pattern …afb/classes/cs7616-spring2014/... ·...

45
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick Aaron Bobick School of Interactive Computing CS6716 Pattern Recognition Bootstrapping, Bagging, Stacking

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS6716 Pattern Recognition Bootstrapping, Bagging, Stacking

Page 2: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Administrivia • End of Chapter 7 and most of 8 of the Hastie book. • Slides brought to you buy Bibhas Chakraborty and

friends • New problems set out tonight or tomorrow.

• Similar datasets • Apply K-NN and Naïve Bayes methods

• For NB, do density estimation for each feature

• For both experiment with Cross-validation (and maybe BIC) • For large datasets, create a real holdout set; use different CV to find best K

or other variables to predict test error; then actually measure test error.

Page 3: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Before the snow • We were looking at AIC, BIC and cross-validation to

determine “best” parameters to use. • Goal is prevent over-fitting to training data and to

predict actual error rate. • Over all CV strategy:

• Pick many sets of training samples and train regressor or classifier.

• Overall average behavior should be predictive of new data

• But there is more we can do with multiple sets

Page 4: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrap Method • General tool for assessing statistical accuracy. • Suppose we have a model to fit the training data:

• The idea is to draw random samples with

replacement of size 𝑁𝑁 from the training data. This process is repeated 𝐵𝐵 times to get 𝐵𝐵 bootstrap datasets.

• Refit the model to each of the bootstrap datasets and examine the behavior of the fits over 𝐵𝐵 replications.

( ){( , ), 1,..., }i i or M NZ x y i N >= =

Page 5: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrap The basic idea: Randomly draw datasets with replacement from the training data, each sample the same size as the original training set

Page 6: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

• Let 𝑆𝑆(𝑍𝑍) be any quantity computed from the data 𝑍𝑍. From the bootstrap sampling, we can estimate any aspect of the distribution of 𝑆𝑆(𝑍𝑍).

• For example, its variance is estimated by where 𝑆𝑆̅∗ = � 𝑆𝑆(𝑍𝑍∗𝑏𝑏 ) 𝐵𝐵⁄ .𝑏𝑏

Bootstrap (Cont’d)

* * 2

1

1( ( )) ( ( ) ) ,1

Bb

bS Z S Z SVar

B == −∑

Page 7: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrap error • Bootstrap error:

• Problem: the test data are in the training data

*

1 1

1 1 ˆ( , ( )).B N

bootb

i ib i

L y f xN

EB

rr= =

= ∑∑

Page 8: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrap over-fitting • Consider a very simple system: Two classes, equally likely,

and the labels 𝑦𝑦𝑖𝑖 are independent of 𝑥𝑥𝑖𝑖 . So the error rate (Bayes rate) should be 50%.

• Now, consider a 1-NN method applied to bootstrap training sets. If a sample is in the training set then, error is 0 otherwise, 0.5

• So error 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 = 0.5 ∗ .368 = 0.184 Much too low…

Page 9: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrap: Mimic cross validation • Fit the model on a set of bootstrap samples keeping

track of predictions from bootstrap samples not containing that observation.

• The leave-one-out bootstrap estimate of prediction error is

where 𝐶𝐶−𝑖𝑖 is set of samples that do no contain i • Solves the over-fitting problem suffered by 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏

but has training-set-size bias, mentioned in the discussion of CV. So probably too high.

(1)*

1

1 1 ˆ( , ( )).| | i

Nb

i iii b CL y f x

N CErr

−−= ∈= ∑ ∑

Page 10: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

The “0.632 Estimator” • Average number of distinct observations in each

bootstrap sample is approximately 0.632 ⋅ 𝑁𝑁 (Why?) • Bias will roughly behave like that of two-fold cross-

validation (biased upwards) – will over estimate error.

• The “0.632 estimator” is designed to get rid of this bias:

• Still too low in this case (0.316 vs 0.5) but better. [Why is 𝑒𝑒𝐸𝐸𝐸𝐸 equal to 0?] See Hastie pg 252 for improvement.

.ˆ632.0368.0ˆ )1()632.0( rrEerrrrE ⋅+⋅=

Page 11: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bootstrapping, MLE and Bayes • Consider fitting some data with a linear combination

of K basis functions (here 7 cubic B-spline basis): ℎ𝑖𝑖 𝑥𝑥 𝑁𝑁 = 50

Page 12: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

• Fitting linear system. Assume points < 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 >. Want predictor:

• Let H be the 𝑁𝑁x7matrix where 𝑖𝑖𝑗𝑗𝑏𝑏𝑡element is ℎ𝑗𝑗 𝑥𝑥𝑖𝑖 (call each row is 𝒉𝒉 𝑥𝑥𝑖𝑖 𝑇𝑇 )

We want to solve: H β = 𝒚𝒚

• Least squares solution (MLE?): �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚

Linear fitting

Page 13: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Least squares spline fit for all data �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚

Page 14: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Linear fit and variance • Can show (Hastie eq 8.3) variance can be estimated

by :

• Where 𝜎𝜎2 is estimated by:

• Then the standard error of 𝜇𝜇� 𝑥𝑥 = 𝒉𝒉 𝑥𝑥 𝑇𝑇�̂�𝛽 is:

Page 15: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

“Error bars” – sense of the distribution �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚

Page 16: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Now bootstrap the same data… • Choose 50 samples from the data (remember, with

replacement) • For each sample set compute �̂�𝛽. 10 sets:

Page 17: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Now estimate variance too 95% Conf

For N=200, 95% means 5th smallest

and biggest

Page 18: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

• Previous method is “non-parametric” bootstrap – generate sample sets (with replacement)

• Parametric: Assume a Gaussian model of known (or estimated) variance. Fit model using all the data, perturb the predictions by Gaussian noise:

• Do this 𝐵𝐵 times. Statistics of predictor:

• Same mean and variance as before…if Gaussian. See Hastie for more.

• Non-parametric is easier…

Parametric Bootstrap

Page 19: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Now Bayesian • For same fitting example, to be Bayesian we need

priors on 𝛽𝛽. For example: 𝛽𝛽~𝑁𝑁 0, 𝜏𝜏𝚺𝚺

where 𝜏𝜏 controls how certain the prior is. Large 𝜏𝜏 is “non-informative” prior:

Look familiar?

Page 20: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Take home message • Bayesian with a “non-informative” prior results in a

posterior distribution over parameters with no prior influence.

• That distribution will be “broad” if a large range of values of the parameters could yield high likelihoods of the data. Will be “peaked” if only a small range is plausible.

• Another way: for small perturbations of the data can you get large variations of the MLE parameters?

• Bootstrapping (non-parametric) yields data perturbations that reveal the possible parameter variations.

• Bootstrap (non-parametric) is “poor man Bayes” but much easier to implement.

Page 21: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging • Introduced by Breiman (Machine Learning, 1996). • Acronym for ‘Bootstrap aggregation’ . • In regression, it averages the prediction over a

collection of bootstrap samples, thus reducing the variance in prediction.

• For classification, a committee (or ensemble – later) of classifiers each cast a vote for the predicted class.

Page 22: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging in regression • Consider the regression problem with training data 𝑍𝑍 = {(𝑥𝑥1,𝑦𝑦1), . . . , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)}. As always, we will fit a model and get a prediction 𝑓𝑓(𝑥𝑥� at the input 𝑥𝑥.

• For each bootstrap sample set 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 fit the model, get the prediction 𝑓𝑓∗𝑏𝑏(𝑥𝑥) .

• Then the bagging (or, bagged) estimate is:

• Note: Bagging only matters if 𝑓𝑓∗𝑏𝑏 is non-linear. (Why?)

.)(ˆ1)(ˆ1

*∑=

=B

b

bbag xf

Bxf

Page 23: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

• Let 𝐺𝐺� be a classifier for a K-class response. Consider an underlying indicator vector function

the entry in the 𝑖𝑖𝑏𝑏𝑡 place is 1 if the prediction for 𝑥𝑥 is the 𝑖𝑖𝑏𝑏𝑡 class, such that

• Then the bagged estimate 𝑓𝑓𝑏𝑏𝑏𝑏𝑏𝑏 𝑥𝑥 = 𝑝𝑝1, … , 𝑝𝑝𝐾𝐾 where 𝑝𝑝𝑘𝑘 is the proportion of base classifiers predicting

class 𝑘𝑘 at 𝑥𝑥 where 𝑘𝑘 = 1, . . . ,𝐾𝐾. • Finally:

Bagging for classification

ˆˆ ( ) arg max ( ).kG x f x=

ˆ ( ) (0,...,0,1,0,...,0)f x =

).(ˆmaxarg)(ˆ xfxG bagk

bag =

Page 24: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging : an simulated example • Generated a sample of size N = 30, with two classes

and p = 5 features, each having a standard Gaussian distribution with pairwise correlation 0.95.

• The label Y was generated according to Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 0|x1 > 0.5) = 0.8.

• So Bayes error rate is .2

Page 25: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Digression: classification trees in 1 slide…

Page 26: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Decision Tree Example

Income

Debt

Page 27: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Decision Tree Example

t1 Income

Debt Income > t1

??

Page 28: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Decision Tree Example

t1

t2

Income

Debt Income > t1

Debt > t2

??

Page 29: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Decision Tree Example

t1 t3

t2

Income

Debt Income > t1

Debt > t2

Income > t3

Page 30: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Decision Tree Example

t1 t3

t2

Income

Debt Income > t1

Debt > t2

Income > t3 Note: tree boundaries are piecewise linear and axis-parallel

Page 31: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging Trees Notice the bootstrap trees are different than the original tree

Page 32: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Good bagging

Page 33: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

The key is independence

Page 34: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging is not always good? • Decision tree – easy to interpret. Bagged trees

(forests?) – not so much. • Bagging makes good classifiers better but can make

bad classifiers worse.

• Simple example: Suppose 𝑌𝑌 = 1 for all 𝑥𝑥. And suppose all 𝐺𝐺�𝑘𝑘(𝑥𝑥) predict 𝑌𝑌 = 1 for 40% of the cases. • Then the misclassification rate for 𝐺𝐺�𝑘𝑘(𝑥𝑥) is 60% but for the

bagged classifier it’s 100%.

Page 35: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bagging can only do so much

Page 36: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Model averaging • MLE - using the entire training set to estimate 𝜃𝜃 –

corresponds to the mode of the distribution over 𝜃𝜃, the maximum of the posterior given the non-informative prior. MAP estimate would be the mode given a prior distribution.

• Bootstrapping gives a sense of the distribution. • Bagging is the mean – so it minimizes a squared error

loss.

• These are all ways of thinking about model averaging.

Page 37: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bayesian Model Averaging • Suppose you want to estimate some quantity 𝜁𝜁 • Candidate models: 𝑀𝑀𝑚𝑚,𝑚𝑚 = 1, … ,𝑀𝑀 • Posterior distribution and mean:

• Bayesian prediction (posterior mean) is a weighted

average of individual predictions, with weights proportional to posterior probability of each model.

• How would you guess those probabilities?

),|Pr(),|Pr()|Pr(1

ZMZMZ mm

M

mζζ ∑

=

=

).|Pr(),|()|(1

ZMZMEZE mm

M

mζζ ∑

=

=

Page 38: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bayesian model averaging • “Committee methods” combine the models with an

un-weighted average, assumes equal probability of each model.

• Or if you had some estimate of model quality, you would use that? How do we evaluate the “goodness” of a model? • BIC – quality of the model is a function of the fit and the DOF

of the model.

Page 39: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Frequentist Model Averaging • Given predictions 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 under squared error loss, we can

seek the weights such that

• The solution is the population least squares linear regression of 𝑌𝑌 on 𝐹𝐹� 𝑥𝑥 𝑇𝑇 ≡ 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 ,

• Can show combining models never makes things worse, at the population level.

• But true population density is not available, apply the only to training set. But this is not going to work well. Imagine the different models have different DOFs. Which will be selected?

∑=

−=M

mmmP

wxfwYEw

1

2.)](ˆ[minargˆ

].)(ˆ[])(ˆ)(ˆ[ˆ 1 YxFExFxFEw PT

P−=

Page 40: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

• Stacked generalization, or stacking is a way to get around the problem.

• The stacking weights are given by

where 𝑓𝑓𝑚𝑚 −𝑖𝑖 are the estimates without 𝑖𝑖𝑏𝑏𝑡 example.

• The final predictor is: ∑ 𝑤𝑤𝑚𝑚𝑠𝑠𝑏𝑏 𝑓𝑓𝑚𝑚 𝑥𝑥𝑀𝑀𝑚𝑚=1

• Close connection with leave-out-one-cross-validation.

Stacking

2

1 1

ˆˆ arg min [ ( )]N M

st i

i m m ii mw

w y w f x−

= == −∑ ∑

Page 41: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

One more…

Page 42: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bumping • Sometimes, models get stuck in local minima. • Example: trees found by splitting along axis

Page 43: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bumping (cont) • Bumping is like bagging but instead of average, pick

best one on original data: • Select B bootstrap samples: 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 and fit

models 𝑓𝑓∗𝑏𝑏 𝑥𝑥 ,𝑏𝑏 = 1, …𝐵𝐵 • Then choose model that gives best fit over all data:

• Fixes two problems: • Pathological points • The need to ignore some of the data to get started well…

Page 44: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

Bumping with trees

With only 20 bumped bootstrap samples!

Page 45: CS7616 Pattern Recognition – A. Bobick CS6716 Pattern …afb/classes/CS7616-Spring2014/... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick Bootstrap, Bagging, Stacking

Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick

References

• Hastie,T., Tibshirani, R. and Friedman, J.-The Elements of Statistical Learning (ch. 7 and 8)