introduction to machine learning - cse · introduction machine learning is tool to extend human...
TRANSCRIPT
Introduction to Machine LearningLinear Regression
Bhaskar Mukhoty, Shivam Bansal
Indian Institute of Techonology KanpurSummer School 2019
May 29, 2019
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 1 / 22
Introduction
Machine learning is tool to extend human intelligence.
Problems involving large amounts of real time data needs assistanceof computers to process. eg. Astronomical / Sub-particle level data,Fraud detection
It is used for automating tasks. eg. self driving cars
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 2 / 22
Future Applications
Potential future uses in Law.
Fun to see top players getting defeated by simple algorithms. eg. Go
Health Care.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 3 / 22
Course Goals
We will focus on theoretical understanding of common machinelearning algorithms.
Mathematical tools needed would be introduced.
Students are encouraged to ask questions, to clarify even simplestdoubts.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 4 / 22
Course Policy
Attendance will be taken in the class.
Passing an 1 hour written exam is required for successful completionof the course.
Course website will be:https://www.cse.iitk.ac.in/users/bhaskarm/IntroToML.html
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 5 / 22
Lecture Outline
System of Linear equations.
Overdetermined system with one or no solution.
Addressing the case for no solution.
The Ordinary Least Square Regression.
The Maximum Likelihood Estimate and Consistency.
The under-determined system, rank deficiency.
The Ridge Regression.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 6 / 22
System of linear equations
Given X ∈ Rn×d , y ∈ Rn and n ≥ d :
Xx11 x12 .. x1dx21 x22 .. x2d··
xn1 xn2 .. xnd
ww1
w2
·wd
=
yy1y2··yn
{yi = xi
>w}ni=1
Question: find w ∈ Rd
Gaussian elimination Figure: Over-determined system
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 7 / 22
Example: An Over Determined System
w2 + w3 = y1
w2 + w4 = y2
w1 + w2 = y3
w3 + w4 = y4
w1 + w3 = y5
w1 + w4 = y6
0 1 1 00 1 0 11 1 0 00 0 1 11 0 1 01 0 0 1
w1
w2
w3
w4
=
y1y2y3y4y5y6
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 8 / 22
Example: An Over Determined System
0 1 1 0 y10 1 0 1 y21 1 0 0 y30 0 1 1 y41 0 1 0 y51 0 0 1 y6
1 0 0 0 −12 y2 + y3 + 1
2 y4 −12 y1
0 1 0 0 12 y2 −
12 y4 + 1
2 y10 0 1 0 −1
2 y2 + 12 y4 + 1
2 y10 0 0 1 1
2 y2 + 12 y4 −
12 y1
0 0 0 0 y2 + y5 − y3 − y40 0 0 0 −y3 − y4 + y1 + r2
One solution vs. no solution:
The system of equations has a solution only if:
y2 + y5 − y3 − y4 = 0
−y3 − y4 + y1 + y2 = 0
https://math.stackexchange.com/questions/1860348/solve-an-overdetermined-system-of-linear-equations
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 9 / 22
The No Solution Regime
There exists no model w such that:
X−2−113
[w ] =
Xw∗−2−113
+
e0
+0.10−0.2
=
y−2−0.9
12.8
What to do now?
Minimize the squared loss
Figure: Linear system with nosolution
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 10 / 22
The squared loss
L(w) =n∑
i=1
(yi−xi>w)2 = ‖y − Xw‖2
Penalizes higher residuals more.
Symmetric loss function.
Can be derived from Gaussianerror.
y−2−0.9
12.8
−X−2−113
[w ]
Figure: Squared Loss
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 11 / 22
The squared loss: continued
L(w) = (−2 + 2w)2 + (−0.9 + w)2 + (1− w)2 + (2.8− 3w)2
dL
dw= 4(−2 + 2w) + 2(−0.9 + w)− 2(1− w)− 6(2.8− 3w)
equating,dL
dw= 0 we have, w = 0.9
X−2−113
w[0.9]=
ypred−1.8−.90.92.7
'y−2−0.9
12.8
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 12 / 22
Ordinary Least Squares Regression
L(w) =n∑
i=1
(yi − xi>w)2 = ‖y − Xw‖2
L(w) = (y − Xw)>(y − Xw)
= [y>y − 2w>X>y + w>X>Xw]
dL
dw= −2X>y + 2X>Xw = 0
The Least Square Estimate
wOLS = arg minw∈Rd
L(w) = (XTX )−1XTy
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 13 / 22
The Gaussian Distribution
p(yi ;µ, σ2) =
1√2πσ2
exp
(−(yi − µ)2
2σ2
)Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 14 / 22
The Assumption of Independent Errors
Assuming ei ∼ N (0, σ2),
yi = xi>w∗ + ei
p(yi |xi>w, σ2) =1√
2πσ2exp
(−(yi − xi
>w)2
2σ2
)Our objective is to maximize:
p(y|Xw, σ2In) =n∏
i=1
p(yi |xi>w, σ2)
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 15 / 22
The maximum likelihood estimate
wMLE = arg maxw
n∏i=1
p(yi |xi>w, σ2)
= arg maxw
logn∏
i=1
p(yi |xi>w, σ2)
= arg minw−log
n∏i=1
p(yi |xi>w, σ2)
= arg minw
NLL(w) NLL: Negative Log Likelihood
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 16 / 22
The equivalence of MLE and OLS
NLL(w) = −logn∏
i=1
p(yi |xi>w, σ2) = −n∑
i=1
log p(yi |xi>w, σ2)
= −n∑
i=1
log
(1√
2πσ2exp
(−(yi − xi
>w)2
2σ2
))
=n∑
i=1
log(√
2πσ2)
+n∑
i=1
(yi − xi>w)2
2σ2
wMLE = arg minw
NLL(w) = arg minw
n∑i=1
(yi − xi>w)2 = wOLS
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 17 / 22
Consistency of MLE
y = Xw∗ + e for unknown w∗ ∈ Rd
wMLE = (XTX )−1XTy
= (XTX )−1XT (Xw∗ + e)
= w∗ + (XTX )−1XTe
‖wMLE −w∗‖2 =∥∥∥(XTX )−1XTe
∥∥∥2
= O(σ
√d
n)
Consistency
With more data points, we get closer to w∗
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 18 / 22
Consistency of MLE - Implementation
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 19 / 22
Drawback of OLS
We know that: wOLS = arg minw∈Rd
L(w) = (XTX )−1XTy
Fact
If rank(A) = p and rank(B) = q then rank(AB) = min(p, q)
If n < d , X>X has rank less than d
That is X>X is not invertible.
Solution
Regularization.
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 20 / 22
Ridge Regression
wridge = arg minw∈Rd{‖y − Xw‖2 + λ‖w‖2}
Exercise:
Show that wridge = (XTX + λId)−1XTy
Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Techonology Kanpur Summer School 2019 )Introduction to Machine Learning May 29, 2019 21 / 22