over-fitting and regularization chapter 4 textbook lectures 11 and 12 on amlbook.com

Over-fitting and RegularizationChapter 4 textbook

Lectures 11 and 12 on amlbook.com

Over-fitting is easy to recognize in 1DParabolic target function4th order hypothesis5 data points -> Ein = 0

The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide?

Shape of fit very sensitive to noise in dataOut-of-sample error will vary greatly from one dataset to another.

Over-fitting is easy to avoid in 1D:Results from HW1

Sum

of s

quar

ed d

evia

tions

Degree of polynomial

Eval

Ein

Using Eval to avoid over-fitting works in all dimensions but computation grows rapidly for large d

EEEin

Ecv-1

Eval

Digit recognition one vs not oned = 2 (intensity and symmetry)Terms in F5(x) added successively500 pts in training set

Validation set needs to be large; 8798 this case

What if we want to add higher order terms to a linear model but don’t have enough data for a validation set?

Solution: Augment the error function used to optimize weights

Example

Penalizes choices with large |w|. Called “weight decay”

Normal equations with weight decay essentially unchanged

(ZTZ + lI) wreg =ZTy

Best value l is subjective

In this case l = 0.0001 is large enough to suppress swings but data still important in determining optimum weights.

Review for Quiz 2Topics:

linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization

2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set for linear binary classification

More Review for Quiz 2Topics:


We have used 1-step optimization in 4 ways:polynomial regression in 1D (curve fitting)multivariate linear regressionextending linear models by transformationregularization by weight decay

2 of these are equivalent; which ones

More Review for Quiz 2Topics:


1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following

multivariate linear regression,extending linear models by transformationregularization by weight decay

For multivariate linear regression

Derive the normal equations for extended linear regression with weight decay

Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise

• Why does Eout approach s2 from above?• Why does Ein approach s2 from below?• Why is Ein not defined for N<d+1?

What do these learning curves say about simple vs complex models

Still larger than bound set by noise

How do we estimate a good level of complexity without sacrificing training data?

Why chose 3 rather than 4?

Review:Maximum Likelihood Estimation• Estimate parameters q of a probability

distribution given a sample X drawn from that distribution

19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

How was MLE used in logistic regression to derive an expression for in-sample error?

Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood)Text also normalizes by dividing by N; hence error function becomes

How?

Derive the log-likelihood function for a 1D Gaussian distribution

)xwexp(y1

ye

derive

)xwexp(-yln(1)y),h(x(e

given

nT

n

nin

nT

nnnin

nx

Stochastic gradient decent: correct weights by error in each data point

I want to perform PCA on a dataset. What must I assume about the noise in data?

PCA

Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x

More PCA

Attributes x are normally distributed with mean m and covariance S.

z = Mx is a linear transformation to feature space defined by matrix M.

What are the mean and covariance of these features?

More PCA

zk is the a feature defined by projection of attributes in the direction of the eigenvector wk of the covariance matrix.

Prove that eigenvalue lk is the variance of zk


More PCA

How do we find values of x1 and x2 that minimize f(x1, x2) subject to the constraint g(x1, x2) = c?

Constrained optimization

Find stationary points of f(x1, x2) = 1 - x12 – x2

2 subject to constraint g(x1, x2) = x1 + x2 = 1

over-fitting and regularization chapter 4 textbook lectures 11 and 12 on amlbook.com

Documents