introduction to machine learning - cse · introduction machine learning is tool to extend human...

Introduction to Machine LearningLinear Regression

Bhaskar Mukhoty, Shivam Bansal

Indian Institute of Techonology KanpurSummer School 2019

May 29, 2019

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 1 / 22

Introduction

Machine learning is tool to extend human intelligence.

Problems involving large amounts of real time data needs assistanceof computers to process. eg. Astronomical / Sub-particle level data,Fraud detection

It is used for automating tasks. eg. self driving cars


Future Applications

Potential future uses in Law.

Fun to see top players getting defeated by simple algorithms. eg. Go

Health Care.


Course Goals

We will focus on theoretical understanding of common machinelearning algorithms.

Mathematical tools needed would be introduced.

Students are encouraged to ask questions, to clarify even simplestdoubts.


Course Policy

Attendance will be taken in the class.

Passing an 1 hour written exam is required for successful completionof the course.

Course website will be:https://www.cse.iitk.ac.in/users/bhaskarm/IntroToML.html


Lecture Outline

System of Linear equations.

Overdetermined system with one or no solution.

Addressing the case for no solution.

The Ordinary Least Square Regression.

The Maximum Likelihood Estimate and Consistency.

The under-determined system, rank deficiency.

The Ridge Regression.


System of linear equations

Given X ∈ Rn×d , y ∈ Rn and n ≥ d :

Xx11 x12 .. x1dx21 x22 .. x2d··

xn1 xn2 .. xnd

ww1

w2

·wd

=

yy1y2··yn

{yi = xi

>w}ni=1

Question: find w ∈ Rd

Gaussian elimination Figure: Over-determined system


Example: An Over Determined System

w2 + w3 = y1

w2 + w4 = y2

w1 + w2 = y3

w3 + w4 = y4

w1 + w3 = y5

w1 + w4 = y6

0 1 1 00 1 0 11 1 0 00 0 1 11 0 1 01 0 0 1

w1

w2

w3

w4

=

y1y2y3y4y5y6


Example: An Over Determined System

0 1 1 0 y10 1 0 1 y21 1 0 0 y30 0 1 1 y41 0 1 0 y51 0 0 1 y6

1 0 0 0 −12 y2 + y3 + 1

2 y4 −12 y1

0 1 0 0 12 y2 −

12 y4 + 1

2 y10 0 1 0 −1

2 y2 + 12 y4 + 1

2 y10 0 0 1 1

2 y2 + 12 y4 −

12 y1

0 0 0 0 y2 + y5 − y3 − y40 0 0 0 −y3 − y4 + y1 + r2

One solution vs. no solution:

The system of equations has a solution only if:

y2 + y5 − y3 − y4 = 0

−y3 − y4 + y1 + y2 = 0

https://math.stackexchange.com/questions/1860348/solve-an-overdetermined-system-of-linear-equations


The No Solution Regime

There exists no model w such that:

X−2−113

[w ] =

Xw∗−2−113

+

e0

+0.10−0.2

=

y−2−0.9

12.8

What to do now?

Minimize the squared loss

Figure: Linear system with nosolution


The squared loss

L(w) =n∑

i=1

(yi−xi>w)2 = ‖y − Xw‖2

Penalizes higher residuals more.

Symmetric loss function.

Can be derived from Gaussianerror.

y−2−0.9

12.8

−X−2−113

[w ]

Figure: Squared Loss


The squared loss: continued

L(w) = (−2 + 2w)2 + (−0.9 + w)2 + (1− w)2 + (2.8− 3w)2

dL

dw= 4(−2 + 2w) + 2(−0.9 + w)− 2(1− w)− 6(2.8− 3w)

equating,dL

dw= 0 we have, w = 0.9

X−2−113

w[0.9]=

ypred−1.8−.90.92.7

'y−2−0.9

12.8


Ordinary Least Squares Regression

L(w) =n∑

i=1

(yi − xi>w)2 = ‖y − Xw‖2

L(w) = (y − Xw)>(y − Xw)

= [y>y − 2w>X>y + w>X>Xw]

dL

dw= −2X>y + 2X>Xw = 0

The Least Square Estimate

wOLS = arg minw∈Rd

L(w) = (XTX )−1XTy


The Gaussian Distribution

p(yi ;µ, σ2) =

1√2πσ2

exp

(−(yi − µ)2

2σ2

)Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 14 / 22

The Assumption of Independent Errors

Assuming ei ∼ N (0, σ2),

yi = xi>w∗ + ei

p(yi |xi>w, σ2) =1√

2πσ2exp

(−(yi − xi

>w)2

2σ2

)Our objective is to maximize:

p(y|Xw, σ2In) =n∏

i=1

p(yi |xi>w, σ2)


The maximum likelihood estimate

wMLE = arg maxw

n∏i=1

p(yi |xi>w, σ2)

= arg maxw

logn∏

i=1

p(yi |xi>w, σ2)

= arg minw−log

n∏i=1

p(yi |xi>w, σ2)

= arg minw

NLL(w) NLL: Negative Log Likelihood


The equivalence of MLE and OLS

NLL(w) = −logn∏

i=1

p(yi |xi>w, σ2) = −n∑

i=1

log p(yi |xi>w, σ2)

= −n∑

i=1

log

(1√

2πσ2exp

(−(yi − xi

>w)2

2σ2

))

=n∑

i=1

log(√

2πσ2)

+n∑

i=1

(yi − xi>w)2

2σ2

wMLE = arg minw

NLL(w) = arg minw

n∑i=1

(yi − xi>w)2 = wOLS


Consistency of MLE

y = Xw∗ + e for unknown w∗ ∈ Rd

wMLE = (XTX )−1XTy

= (XTX )−1XT (Xw∗ + e)

= w∗ + (XTX )−1XTe

‖wMLE −w∗‖2 =∥∥∥(XTX )−1XTe

∥∥∥2

= O(σ

√d

n)

Consistency

With more data points, we get closer to w∗


Consistency of MLE - Implementation


Drawback of OLS

We know that: wOLS = arg minw∈Rd

L(w) = (XTX )−1XTy

Fact

If rank(A) = p and rank(B) = q then rank(AB) = min(p, q)

If n < d , X>X has rank less than d

That is X>X is not invertible.

Solution

Regularization.


Ridge Regression

wridge = arg minw∈Rd{‖y − Xw‖2 + λ‖w‖2}

Exercise:

Show that wridge = (XTX + λId)−1XTy


Questions?


introduction to machine learning - cse · introduction machine learning is tool to extend human...

Documents