cs 59000 statistical machine learning lecture 7
DESCRIPTION
CS 59000 Statistical Machine learning Lecture 7. Yuan (Alan) Qi Purdue CS Sept. 16 2008. Acknowledgement: Sargur Srihari’s slides. Outline. Review of noninformative priors, nonparametric methods, and nonlinear basis functions Regularized regression Bayesian regression Equivalent kernel - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/1.jpg)
CS 59000 Statistical Machine learningLecture 7
Yuan (Alan) QiPurdue CS
Sept. 16 2008
Acknowledgement: Sargur Srihari’s slides
![Page 2: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/2.jpg)
Outline
Review of noninformative priors, nonparametric methods, and nonlinear basis functions
Regularized regressionBayesian regressionEquivalent kernel Model Comparison
![Page 3: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/3.jpg)
The Exponential Family (1)
where ´ is the natural parameter and
so g(´) can be interpreted as a normalization coefficient.
![Page 4: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/4.jpg)
Property of Normalization Coefficient
From the definition of g(´) we get
Thus
![Page 5: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/5.jpg)
Conjugate priors
For any member of the exponential family, there exists a prior
Combining with the likelihood function, we get
Prior corresponds to º pseudo-observations with value Â.
![Page 6: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/6.jpg)
Noninformative Priors (1)
With little or no information available a-priori, we might choose a non-informative prior.• ¸ discrete, K-nomial :• ¸2[a,b] real and bounded: • ¸ real and unbounded: improper!
A constant prior may no longer be constant after a change of variable; consider p(¸) constant and ¸=´2:
![Page 7: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/7.jpg)
Noninformative Priors (2)
Translation invariant priors. Consider
For a corresponding prior over ¹, we have
for any A and B. Thus p(¹) = p(¹ { c) and p(¹) must be constant.
![Page 8: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/8.jpg)
Noninformative Priors (4)
Scale invariant priors. Consider and make the change of variable
For a corresponding prior over ¾, we have
for any A and B. Thus p(¾) / 1/¾ and so this prior is improper too. Note that this corresponds to p(ln
¾) being constant.
![Page 9: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/9.jpg)
Nonparametric Methods (1)
Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.
Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.
![Page 10: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/10.jpg)
Nonparametric Methods (2)
Histogram methods partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.
•Often, the same width is used for all bins, ¢i = ¢.•¢ acts as a smoothing parameter.
•In a D-dimensional space, using M bins in each dimen-sion will require MD bins!
![Page 11: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/11.jpg)
Nonparametric Methods (3)
Assume observations drawn from a density p(x) and consider a small region R containing x such that
The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large
If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and
Thus
![Page 12: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/12.jpg)
Nonparametric Methods (5)
To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian
Any kernel such that
will work.
h acts as a smoother.
![Page 13: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/13.jpg)
K-Nearest-Neighbours for Classification (1)
Given a data set with Nk data points from class Ck and , we have
and correspondingly
Since , Bayes’ theorem gives
![Page 14: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/14.jpg)
K-Nearest-Neighbours for Classification (2)
K = 1K = 3
![Page 15: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/15.jpg)
Basis Functions
![Page 16: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/16.jpg)
Examples of Basis Functions (1)
![Page 17: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/17.jpg)
Maximum Likelihood Estimation (1)
![Page 18: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/18.jpg)
Maximum Likelihood Estimation (2)
![Page 19: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/19.jpg)
Sequential Estimation
![Page 20: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/20.jpg)
Regularized Least Squares
![Page 21: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/21.jpg)
More Regularizers
![Page 22: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/22.jpg)
Visualization of Regularized Regression
![Page 23: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/23.jpg)
Bayesian Linear Regression
![Page 24: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/24.jpg)
Posterior Distributions of Parameters
![Page 25: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/25.jpg)
Predictive Posterior Distribution
![Page 26: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/26.jpg)
Examples of PredictiveDistribution
![Page 27: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/27.jpg)
Question
Suppose we use Gaussian basis functions.
What will happen to the predictive distribution if we evaluate it at places far from all training data points?
![Page 28: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/28.jpg)
Equivalent Kernel
Given Predictive mean is
where
![Page 29: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/29.jpg)
Equivalent kernel
Basis Function: Equivalent kernel:Gaussian
Polynomial
Sigmoid
![Page 30: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/30.jpg)
Covariance between two predictions
Predictive mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be smaller.
![Page 31: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/31.jpg)
Bayesian Model Comparison
Suppose we want to compare models .Given a training set , we compute
Model evidence (also known as marginal likelihood):
Bayes factor:
![Page 32: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/32.jpg)
Likelihood, Parameter Posterior & Evidence
Likelihood and evidence
Parameter posterior distribution and evidence
![Page 33: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/33.jpg)
Crude Evidence Approximation
Assume posterior distribution is centered around its mode
![Page 34: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/34.jpg)
Evidence penalizes over-complex models
Given M parameters
Maximizing evidence leads to a natural trade-off between data fitting & model complexity.
![Page 35: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/35.jpg)
Evidence Approximation & Empirical BayesApproximating the evidence by maximizing
marginal likelihood.
Where hyperparameters maximize the evidence .
Known as Empirical Bayes or type2 maximum likelihood
![Page 36: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/36.jpg)
Model Evidence and Cross-Validation
Root-mean-square error Model evidence
Fitting polynomial regression models
![Page 37: CS 59000 Statistical Machine learning Lecture 7](https://reader035.vdocuments.us/reader035/viewer/2022062422/56812d5b550346895d92624a/html5/thumbnails/37.jpg)
Next class
Linear Classification