cs7616 pattern recognition – a. bobick cs6716 pattern …afb/classes/cs7616-spring2014/... ·...
TRANSCRIPT
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Aaron Bobick School of Interactive Computing
CS6716 Pattern Recognition Bootstrapping, Bagging, Stacking
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Administrivia • End of Chapter 7 and most of 8 of the Hastie book. • Slides brought to you buy Bibhas Chakraborty and
friends • New problems set out tonight or tomorrow.
• Similar datasets • Apply K-NN and Naïve Bayes methods
• For NB, do density estimation for each feature
• For both experiment with Cross-validation (and maybe BIC) • For large datasets, create a real holdout set; use different CV to find best K
or other variables to predict test error; then actually measure test error.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Before the snow • We were looking at AIC, BIC and cross-validation to
determine “best” parameters to use. • Goal is prevent over-fitting to training data and to
predict actual error rate. • Over all CV strategy:
• Pick many sets of training samples and train regressor or classifier.
• Overall average behavior should be predictive of new data
• But there is more we can do with multiple sets
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrap Method • General tool for assessing statistical accuracy. • Suppose we have a model to fit the training data:
• The idea is to draw random samples with
replacement of size 𝑁𝑁 from the training data. This process is repeated 𝐵𝐵 times to get 𝐵𝐵 bootstrap datasets.
• Refit the model to each of the bootstrap datasets and examine the behavior of the fits over 𝐵𝐵 replications.
( ){( , ), 1,..., }i i or M NZ x y i N >= =
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrap The basic idea: Randomly draw datasets with replacement from the training data, each sample the same size as the original training set
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
• Let 𝑆𝑆(𝑍𝑍) be any quantity computed from the data 𝑍𝑍. From the bootstrap sampling, we can estimate any aspect of the distribution of 𝑆𝑆(𝑍𝑍).
• For example, its variance is estimated by where 𝑆𝑆̅∗ = � 𝑆𝑆(𝑍𝑍∗𝑏𝑏 ) 𝐵𝐵⁄ .𝑏𝑏
Bootstrap (Cont’d)
* * 2
1
1( ( )) ( ( ) ) ,1
Bb
bS Z S Z SVar
B == −∑
−
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrap error • Bootstrap error:
• Problem: the test data are in the training data
*
1 1
1 1 ˆ( , ( )).B N
bootb
i ib i
L y f xN
EB
rr= =
= ∑∑
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrap over-fitting • Consider a very simple system: Two classes, equally likely,
and the labels 𝑦𝑦𝑖𝑖 are independent of 𝑥𝑥𝑖𝑖 . So the error rate (Bayes rate) should be 50%.
• Now, consider a 1-NN method applied to bootstrap training sets. If a sample is in the training set then, error is 0 otherwise, 0.5
• So error 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 = 0.5 ∗ .368 = 0.184 Much too low…
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrap: Mimic cross validation • Fit the model on a set of bootstrap samples keeping
track of predictions from bootstrap samples not containing that observation.
• The leave-one-out bootstrap estimate of prediction error is
where 𝐶𝐶−𝑖𝑖 is set of samples that do no contain i • Solves the over-fitting problem suffered by 𝐸𝐸𝐸𝐸𝐸𝐸� 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
but has training-set-size bias, mentioned in the discussion of CV. So probably too high.
(1)*
1
1 1 ˆ( , ( )).| | i
Nb
i iii b CL y f x
N CErr
−−= ∈= ∑ ∑
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
The “0.632 Estimator” • Average number of distinct observations in each
bootstrap sample is approximately 0.632 ⋅ 𝑁𝑁 (Why?) • Bias will roughly behave like that of two-fold cross-
validation (biased upwards) – will over estimate error.
• The “0.632 estimator” is designed to get rid of this bias:
• Still too low in this case (0.316 vs 0.5) but better. [Why is 𝑒𝑒𝐸𝐸𝐸𝐸 equal to 0?] See Hastie pg 252 for improvement.
.ˆ632.0368.0ˆ )1()632.0( rrEerrrrE ⋅+⋅=
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bootstrapping, MLE and Bayes • Consider fitting some data with a linear combination
of K basis functions (here 7 cubic B-spline basis): ℎ𝑖𝑖 𝑥𝑥 𝑁𝑁 = 50
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
• Fitting linear system. Assume points < 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 >. Want predictor:
• Let H be the 𝑁𝑁x7matrix where 𝑖𝑖𝑗𝑗𝑏𝑏𝑡element is ℎ𝑗𝑗 𝑥𝑥𝑖𝑖 (call each row is 𝒉𝒉 𝑥𝑥𝑖𝑖 𝑇𝑇 )
We want to solve: H β = 𝒚𝒚
• Least squares solution (MLE?): �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚
Linear fitting
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Least squares spline fit for all data �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Linear fit and variance • Can show (Hastie eq 8.3) variance can be estimated
by :
• Where 𝜎𝜎2 is estimated by:
• Then the standard error of 𝜇𝜇� 𝑥𝑥 = 𝒉𝒉 𝑥𝑥 𝑇𝑇�̂�𝛽 is:
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
“Error bars” – sense of the distribution �̂�𝛽 = H𝑇𝑇𝑯𝑯 −1H𝑇𝑇𝒚𝒚
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Now bootstrap the same data… • Choose 50 samples from the data (remember, with
replacement) • For each sample set compute �̂�𝛽. 10 sets:
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Now estimate variance too 95% Conf
For N=200, 95% means 5th smallest
and biggest
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
• Previous method is “non-parametric” bootstrap – generate sample sets (with replacement)
• Parametric: Assume a Gaussian model of known (or estimated) variance. Fit model using all the data, perturb the predictions by Gaussian noise:
• Do this 𝐵𝐵 times. Statistics of predictor:
• Same mean and variance as before…if Gaussian. See Hastie for more.
• Non-parametric is easier…
Parametric Bootstrap
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Now Bayesian • For same fitting example, to be Bayesian we need
priors on 𝛽𝛽. For example: 𝛽𝛽~𝑁𝑁 0, 𝜏𝜏𝚺𝚺
where 𝜏𝜏 controls how certain the prior is. Large 𝜏𝜏 is “non-informative” prior:
Look familiar?
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Take home message • Bayesian with a “non-informative” prior results in a
posterior distribution over parameters with no prior influence.
• That distribution will be “broad” if a large range of values of the parameters could yield high likelihoods of the data. Will be “peaked” if only a small range is plausible.
• Another way: for small perturbations of the data can you get large variations of the MLE parameters?
• Bootstrapping (non-parametric) yields data perturbations that reveal the possible parameter variations.
• Bootstrap (non-parametric) is “poor man Bayes” but much easier to implement.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging • Introduced by Breiman (Machine Learning, 1996). • Acronym for ‘Bootstrap aggregation’ . • In regression, it averages the prediction over a
collection of bootstrap samples, thus reducing the variance in prediction.
• For classification, a committee (or ensemble – later) of classifiers each cast a vote for the predicted class.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging in regression • Consider the regression problem with training data 𝑍𝑍 = {(𝑥𝑥1,𝑦𝑦1), . . . , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)}. As always, we will fit a model and get a prediction 𝑓𝑓(𝑥𝑥� at the input 𝑥𝑥.
• For each bootstrap sample set 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 fit the model, get the prediction 𝑓𝑓∗𝑏𝑏(𝑥𝑥) .
• Then the bagging (or, bagged) estimate is:
• Note: Bagging only matters if 𝑓𝑓∗𝑏𝑏 is non-linear. (Why?)
.)(ˆ1)(ˆ1
*∑=
=B
b
bbag xf
Bxf
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
• Let 𝐺𝐺� be a classifier for a K-class response. Consider an underlying indicator vector function
the entry in the 𝑖𝑖𝑏𝑏𝑡 place is 1 if the prediction for 𝑥𝑥 is the 𝑖𝑖𝑏𝑏𝑡 class, such that
• Then the bagged estimate 𝑓𝑓𝑏𝑏𝑏𝑏𝑏𝑏 𝑥𝑥 = 𝑝𝑝1, … , 𝑝𝑝𝐾𝐾 where 𝑝𝑝𝑘𝑘 is the proportion of base classifiers predicting
class 𝑘𝑘 at 𝑥𝑥 where 𝑘𝑘 = 1, . . . ,𝐾𝐾. • Finally:
Bagging for classification
ˆˆ ( ) arg max ( ).kG x f x=
ˆ ( ) (0,...,0,1,0,...,0)f x =
).(ˆmaxarg)(ˆ xfxG bagk
bag =
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging : an simulated example • Generated a sample of size N = 30, with two classes
and p = 5 features, each having a standard Gaussian distribution with pairwise correlation 0.95.
• The label Y was generated according to Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 0|x1 > 0.5) = 0.8.
• So Bayes error rate is .2
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Digression: classification trees in 1 slide…
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Decision Tree Example
Income
Debt
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Decision Tree Example
t1 Income
Debt Income > t1
??
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Decision Tree Example
t1
t2
Income
Debt Income > t1
Debt > t2
??
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Decision Tree Example
t1 t3
t2
Income
Debt Income > t1
Debt > t2
Income > t3
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Decision Tree Example
t1 t3
t2
Income
Debt Income > t1
Debt > t2
Income > t3 Note: tree boundaries are piecewise linear and axis-parallel
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging Trees Notice the bootstrap trees are different than the original tree
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Good bagging
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
The key is independence
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging is not always good? • Decision tree – easy to interpret. Bagged trees
(forests?) – not so much. • Bagging makes good classifiers better but can make
bad classifiers worse.
• Simple example: Suppose 𝑌𝑌 = 1 for all 𝑥𝑥. And suppose all 𝐺𝐺�𝑘𝑘(𝑥𝑥) predict 𝑌𝑌 = 1 for 40% of the cases. • Then the misclassification rate for 𝐺𝐺�𝑘𝑘(𝑥𝑥) is 60% but for the
bagged classifier it’s 100%.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bagging can only do so much
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Model averaging • MLE - using the entire training set to estimate 𝜃𝜃 –
corresponds to the mode of the distribution over 𝜃𝜃, the maximum of the posterior given the non-informative prior. MAP estimate would be the mode given a prior distribution.
• Bootstrapping gives a sense of the distribution. • Bagging is the mean – so it minimizes a squared error
loss.
• These are all ways of thinking about model averaging.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bayesian Model Averaging • Suppose you want to estimate some quantity 𝜁𝜁 • Candidate models: 𝑀𝑀𝑚𝑚,𝑚𝑚 = 1, … ,𝑀𝑀 • Posterior distribution and mean:
• Bayesian prediction (posterior mean) is a weighted
average of individual predictions, with weights proportional to posterior probability of each model.
• How would you guess those probabilities?
),|Pr(),|Pr()|Pr(1
ZMZMZ mm
M
mζζ ∑
=
=
).|Pr(),|()|(1
ZMZMEZE mm
M
mζζ ∑
=
=
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bayesian model averaging • “Committee methods” combine the models with an
un-weighted average, assumes equal probability of each model.
• Or if you had some estimate of model quality, you would use that? How do we evaluate the “goodness” of a model? • BIC – quality of the model is a function of the fit and the DOF
of the model.
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Frequentist Model Averaging • Given predictions 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 under squared error loss, we can
seek the weights such that
• The solution is the population least squares linear regression of 𝑌𝑌 on 𝐹𝐹� 𝑥𝑥 𝑇𝑇 ≡ 𝑓𝑓1 𝑥𝑥 , . . . , 𝑓𝑓𝑀𝑀 𝑥𝑥 ,
• Can show combining models never makes things worse, at the population level.
• But true population density is not available, apply the only to training set. But this is not going to work well. Imagine the different models have different DOFs. Which will be selected?
∑=
−=M
mmmP
wxfwYEw
1
2.)](ˆ[minargˆ
].)(ˆ[])(ˆ)(ˆ[ˆ 1 YxFExFxFEw PT
P−=
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
• Stacked generalization, or stacking is a way to get around the problem.
• The stacking weights are given by
where 𝑓𝑓𝑚𝑚 −𝑖𝑖 are the estimates without 𝑖𝑖𝑏𝑏𝑡 example.
• The final predictor is: ∑ 𝑤𝑤𝑚𝑚𝑠𝑠𝑏𝑏 𝑓𝑓𝑚𝑚 𝑥𝑥𝑀𝑀𝑚𝑚=1
• Close connection with leave-out-one-cross-validation.
Stacking
2
1 1
ˆˆ arg min [ ( )]N M
st i
i m m ii mw
w y w f x−
= == −∑ ∑
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
One more…
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bumping • Sometimes, models get stuck in local minima. • Example: trees found by splitting along axis
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bumping (cont) • Bumping is like bagging but instead of average, pick
best one on original data: • Select B bootstrap samples: 𝑍𝑍∗𝑏𝑏 , 𝑏𝑏 = 1, …𝐵𝐵 and fit
models 𝑓𝑓∗𝑏𝑏 𝑥𝑥 ,𝑏𝑏 = 1, …𝐵𝐵 • Then choose model that gives best fit over all data:
• Fixes two problems: • Pathological points • The need to ignore some of the data to get started well…
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
Bumping with trees
With only 20 bumped bootstrap samples!
Bootstrap, Bagging, Stacking CS7616 Pattern Recognition – A. Bobick
References
• Hastie,T., Tibshirani, R. and Friedman, J.-The Elements of Statistical Learning (ch. 7 and 8)