lecture 10: linear regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · solve to getnormal...

15
Lecture 10: Linear Regression Madhavan Mukund https://www.cmi.ac.in/ ~ madhavan Data Mining and Machine Learning August–December 2020

Upload: others

Post on 26-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Lecture 10: Linear Regression

Madhavan Mukund

https://www.cmi.ac.in/~madhavan

Data Mining and Machine LearningAugust–December 2020

Page 2: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Predicting numerical values

Data about housing prices

Predict house price from living area

Scatterplot corresponding to the data

Fit a function to the points

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 2 / 15

Page 3: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Linear predictors

A richer set of input data

Simplest case: fit a linearfunction with parametersθ = (θ0, θ1, θ2)

hθ(x) = θ0 + θ1x1 + θ2x2

Input x may have k features(x1, x2, . . . , xk)

By convention, add a dummyfeature x0 = 1

For k input features

hθ(x) =k∑

i=0

θixi

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 3 / 15

Page 4: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Finding the best fit line

Training input is{(x1, y1), (x2, y2), . . . , (xn, yn)}

Each input xi is a vector (x1i , . . . , xki )

Add x0i = 1 by convention

yi is actual output

How far away is our prediction hθ(xi ) fromthe true answer yi?

Define a cost (loss) function

J(θ) =1

2

n∑i=0

(hθ(xi )− yi )2

Essentially, the sum squared error (SSE)

Divide by n, mean squared error (MSE)Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 4 / 15

Page 5: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE

Write xi as row vector[

1 x1i · · · xki]

X =

1 x11 · · · xk11 x12 · · · xk2

· · ·1 x1i · · · xki

· · ·1 x1n · · · xkn

, y =

y1y2· · ·yi· · ·yn

Write θ as column vector, θT =

[θ0 θ1 · · · θk

]J(θ) =

1

2

n∑i=1

(hθ(xi )− yi )2 =

1

2(Xθ − y)T (Xθ − y)

Minimize J(θ) — set ∇θ J(θ) = 0

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 5 / 15

Page 6: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE

J(θ) =1

2(Xθ − y)T (Xθ − y)

∇θ J(θ) = ∇θ12(Xθ − y)T (Xθ − y)

To minimize, set ∇θ12(Xθ − y)T (Xθ − y) = 0

Expand, 12∇θ (θTXTXθ − yTXθ − θTXT y + yT y) = 0

Check that yTXθ = θTXT y =n∑

i=1

hθ(xi ) · yi

Combining terms, 12∇θ (θTXTXθ − 2θTXT y + yT y) = 0

After differentiating, XTXθ − XT y = 0

Solve to get normal equation, θ = (XTX )−1XT y

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 6 / 15

Page 7: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE iteratively

Normal equation θ = (XTX )−1XT y is a closed form solution

Computational challenges

Slow if n large, say n > 104

Matrix inversion (XTX )−1 is expensive, also need invertibility

Iterative approach, make an initialguess

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 7 / 15

Page 8: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE iteratively

Normal equation θ = (XTX )−1XT y is a closed form solution

Computational challenges

Slow if n large, say n > 104

Matrix inversion (XTX )−1 is expensive, also need invertibility

Iterative approach, make an initialguess

Keep adjusting the line to reduce SSE

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 8 / 15

Page 9: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE iteratively

Normal equation θ = (XTX )−1XT y is a closed form solution

Computational challenges

Slow if n large, say n > 104

Matrix inversion (XTX )−1 is expensive, also need invertibility

Iterative approach, make an initialguess

Keep adjusting the line to reduce SSE

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 9 / 15

Page 10: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Minimizing SSE iteratively

Normal equation θ = (XTX )−1XT y is a closed form solution

Computational challenges

Slow if n large, say n > 104

Matrix inversion (XTX )−1 is expensive, also need invertibility

Iterative approach, make an initialguess

Keep adjusting the line to reduce SSE

Stop when we find the best fit line

How do we adjust the line?

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 10 / 15

Page 11: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Gradient descent

How does cost vary with parametersθ = (θ0, θ1, . . . , θk)?

Gradients∂

∂θiJ(θ)

Adjust each parameter against gradient

θi = θi − α∂

∂θiJ(θ)

For a single training sample (x , y)

∂θiJ(θ) =

∂θi

1

2(hθ(x)− y)2

= 2 · 1

2(hθ(x)− y)

∂θi(hθ(x)− y)

= (hθ(x)− y)∂

∂θi

k∑j=1

θj(x)

− y

= (hθ(x)− y) · xi

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 11 / 15

Page 12: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Gradient descent

For a single training sample (x , y),∂

∂θiJ(θ) = (hθ(x)− y) · xi

Over the entire training set,∂

∂θiJ(θ) =

n∑j=1

(hθ(xj)− yj) · x ij

Batch gradient descent

Compute hθ(xj) for entire training set{(x1, y1), . . . , (xn, yn)}

Adjust each parameter

θi = θi − α∂

∂θiJ(θ)

= θi − α ·n∑

j=1

(hθ(xj)− yj) · x ij

Repeat until convergence

Stochastic gradient descent

For each input xj , compute hθ(xj)

Adjust each parameter —θi = θi − α · (hθ(xj)− y) · x ij

Pros and cons

Faster progress for large batch size

May oscillate indefinitely

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 12 / 15

Page 13: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Regression and SSE loss

Training input is {(x1, y1), (x2, y2), . . . , (xn, yn)}Noisy outputs from a linear function

yi = θT xi + ε

ε ∼ N (0, σ2) : Gaussian noise, mean 0, fixed variance σ2

yi ∼ N (µi , σ2), µi = θT xi

Model gives us an estimate for θ, so regression learns µi for each xi

Want Maximum Likelihood Estimator (MLE) — maximize

L(θ) =n∏

i=1

P(yi | xi ; θ)

Instead, maximize log likelihood

`(θ) = log

(n∏

i=1

P(yi | xi ; θ)

)=

n∑i=1

log(P(yi | xi ; θ))

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 13 / 15

Page 14: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Log likelihood and SSE loss

yi = N (µi , σ2), so P(yi | xi ; θ) =

1√2πσ2

e−(y−µi )

2

2σ2 =1√

2πσ2e−

(y−θT xi )2

2σ2

Log likelihood (assuming natural logarithm)

`(θ) =n∑

i=1

log

(1√

2πσ2e−

(y−θT xi )2

2σ2

)= n log

(1√

2πσ2

)−

n∑i=1

(y − θT xi )2

2σ2

To maximize `(θ) with respect to θ, ignore all terms that do not depend on θ

Optimum value of θ is given by

θ̂MSE = arg maxθ

[−

n∑i=1

(yi − θT xi )2]

= arg minθ

[n∑

i=1

(yi − θT xi )2]

Assuming data points are generated by linear function and then perturbed byGaussian noise, SSE is the “correct” loss function to maximize likelihood

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 14 / 15

Page 15: Lecture 10: Linear Regressionmadhavan/courses/dmml2020aug/... · 2020. 9. 6. · Solve to getnormal equation, = (XTX) 1XTy Madhavan Mukund Lecture 10: Linear Regression DMML Aug{Dec

Summary

Regression finds the best fit line through a set of points

Measure goodness of fit using sum squared error (SSE)

Justification: MLE assuming data points are linear points perturbed by Gaussiannoise

Normal equation gives a direct solution in terms of matrix operations

Gradient descent provides an iterative solution

Batch gradient descent vs stochastic gradient descent

Madhavan Mukund Lecture 10: Linear Regression DMML Aug–Dec 2020 15 / 15