over-fitting and regularization chapter 4 textbook lectures 11 and 12 on amlbook.com
TRANSCRIPT
Over-fitting and RegularizationChapter 4 textbook
Lectures 11 and 12 on amlbook.com
Over-fitting is easy to recognize in 1DParabolic target function4th order hypothesis5 data points -> Ein = 0
The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide?
Shape of fit very sensitive to noise in dataOut-of-sample error will vary greatly from one dataset to another.
Over-fitting is easy to avoid in 1D:Results from HW1
Sum
of s
quar
ed d
evia
tions
Degree of polynomial
Eval
Ein
Using Eval to avoid over-fitting works in all dimensions but computation grows rapidly for large d
EEEin
Ecv-1
Eval
Digit recognition one vs not oned = 2 (intensity and symmetry)Terms in F5(x) added successively500 pts in training set
Validation set needs to be large; 8798 this case
What if we want to add higher order terms to a linear model but don’t have enough data for a validation set?
Solution: Augment the error function used to optimize weights
Example
Penalizes choices with large |w|. Called “weight decay”
Normal equations with weight decay essentially unchanged
(ZTZ + lI) wreg =ZTy
Best value l is subjective
In this case l = 0.0001 is large enough to suppress swings but data still important in determining optimum weights.
Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set for linear binary classification
More Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
We have used 1-step optimization in 4 ways:polynomial regression in 1D (curve fitting)multivariate linear regressionextending linear models by transformationregularization by weight decay
2 of these are equivalent; which ones
More Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following
multivariate linear regression,extending linear models by transformationregularization by weight decay
For multivariate linear regression
Derive the normal equations for extended linear regression with weight decay
Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise
• Why does Eout approach s2 from above?• Why does Ein approach s2 from below?• Why is Ein not defined for N<d+1?
What do these learning curves say about simple vs complex models
Still larger than bound set by noise
How do we estimate a good level of complexity without sacrificing training data?
Why chose 3 rather than 4?
Review:Maximum Likelihood Estimation• Estimate parameters q of a probability
distribution given a sample X drawn from that distribution
19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Form the likelihood function• Likelihood of q given the sample X
l(θ|X) = p (X |θ) = ∏t p(xt|θ)
• Log likelihood L(θ|X) = log(l(θ|X)) = ∑
t log p(xt|θ)
• Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
the value of θ that maximizes L(θ|X)
20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
How was MLE used in logistic regression to derive an expression for in-sample error?
In logistic regression, parameters are the weights
Likelihood of w given the sample Xl(w|X) = p (X |w) = ∏
t p(xt|w)
Log likelihood L(w|X) = log(l(w|X)) = ∑
t log p(xt|w)
In logistic regression, p(xt|w) = q(ynwT xn)
Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood)Text also normalizes by dividing by N; hence error function becomes
How?
Derive the log-likelihood function for a 1D Gaussian distribution
)xwexp(y1
ye
derive
)xwexp(-yln(1)y),h(x(e
given
nT
n
nin
nT
nnnin
nx
Stochastic gradient decent: correct weights by error in each data point
I want to perform PCA on a dataset. What must I assume about the noise in data?
PCA
Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x
More PCA
Attributes x are normally distributed with mean m and covariance S.
z = Mx is a linear transformation to feature space defined by matrix M.
What are the mean and covariance of these features?
More PCA
zk is the a feature defined by projection of attributes in the direction of the eigenvector wk of the covariance matrix.
Prove that eigenvalue lk is the variance of zk
29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
More PCA
How do we find values of x1 and x2 that minimize f(x1, x2) subject to the constraint g(x1, x2) = c?
Constrained optimization
Find stationary points of f(x1, x2) = 1 - x12 – x2
2 subject to constraint g(x1, x2) = x1 + x2 = 1