over-fitting and regularization chapter 4 textbook lectures 11 and 12 on amlbook.com

30
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Upload: christian-hunter

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Over-fitting and RegularizationChapter 4 textbook

Lectures 11 and 12 on amlbook.com

Page 2: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Over-fitting is easy to recognize in 1DParabolic target function4th order hypothesis5 data points -> Ein = 0

Page 3: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide?

Page 4: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Shape of fit very sensitive to noise in dataOut-of-sample error will vary greatly from one dataset to another.

Page 5: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Over-fitting is easy to avoid in 1D:Results from HW1

Sum

of s

quar

ed d

evia

tions

Degree of polynomial

Eval

Ein

Page 6: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Using Eval to avoid over-fitting works in all dimensions but computation grows rapidly for large d

EEEin

Ecv-1

Eval

Digit recognition one vs not oned = 2 (intensity and symmetry)Terms in F5(x) added successively500 pts in training set

Validation set needs to be large; 8798 this case

Page 7: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

What if we want to add higher order terms to a linear model but don’t have enough data for a validation set?

Solution: Augment the error function used to optimize weights

Example

Penalizes choices with large |w|. Called “weight decay”

Page 8: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Normal equations with weight decay essentially unchanged

(ZTZ + lI) wreg =ZTy

Page 9: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Best value l is subjective

In this case l = 0.0001 is large enough to suppress swings but data still important in determining optimum weights.

Page 10: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com
Page 11: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Review for Quiz 2Topics:

linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization

2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set for linear binary classification

Page 12: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

More Review for Quiz 2Topics:

linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization

We have used 1-step optimization in 4 ways:polynomial regression in 1D (curve fitting)multivariate linear regressionextending linear models by transformationregularization by weight decay

2 of these are equivalent; which ones

Page 13: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

More Review for Quiz 2Topics:

linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization

1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following

multivariate linear regression,extending linear models by transformationregularization by weight decay

Page 14: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

For multivariate linear regression

Derive the normal equations for extended linear regression with weight decay

Page 15: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise

• Why does Eout approach s2 from above?• Why does Ein approach s2 from below?• Why is Ein not defined for N<d+1?

Page 16: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

What do these learning curves say about simple vs complex models

Still larger than bound set by noise

Page 17: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

How do we estimate a good level of complexity without sacrificing training data?

Page 18: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Why chose 3 rather than 4?

Page 19: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Review:Maximum Likelihood Estimation• Estimate parameters q of a probability

distribution given a sample X drawn from that distribution

19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 20: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Form the likelihood function• Likelihood of q given the sample X

l(θ|X) = p (X |θ) = ∏t p(xt|θ)

• Log likelihood L(θ|X) = log(l(θ|X)) = ∑

t log p(xt|θ)

• Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

the value of θ that maximizes L(θ|X)

20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 21: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

How was MLE used in logistic regression to derive an expression for in-sample error?

Page 22: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

In logistic regression, parameters are the weights

Likelihood of w given the sample Xl(w|X) = p (X |w) = ∏

t p(xt|w)

Log likelihood L(w|X) = log(l(w|X)) = ∑

t log p(xt|w)

In logistic regression, p(xt|w) = q(ynwT xn)

Page 23: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood)Text also normalizes by dividing by N; hence error function becomes

How?

Page 24: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Derive the log-likelihood function for a 1D Gaussian distribution

Page 25: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

)xwexp(y1

ye

derive

)xwexp(-yln(1)y),h(x(e

given

nT

n

nin

nT

nnnin

nx

Stochastic gradient decent: correct weights by error in each data point

Page 26: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

I want to perform PCA on a dataset. What must I assume about the noise in data?

PCA

Page 27: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x

More PCA

Page 28: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Attributes x are normally distributed with mean m and covariance S.

z = Mx is a linear transformation to feature space defined by matrix M.

What are the mean and covariance of these features?

More PCA

Page 29: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

zk is the a feature defined by projection of attributes in the direction of the eigenvector wk of the covariance matrix.

Prove that eigenvalue lk is the variance of zk

29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

More PCA

Page 30: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

How do we find values of x1 and x2 that minimize f(x1, x2) subject to the constraint g(x1, x2) = c?

Constrained optimization

Find stationary points of f(x1, x2) = 1 - x12 – x2

2 subject to constraint g(x1, x2) = x1 + x2 = 1