maximum likelihood · 2015. 1. 29. · microsoft powerpoint - lecture3.ppt [compatibility mode]...
TRANSCRIPT
![Page 1: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/1.jpg)
Maximum LikelihoodNando de Freitas
![Page 2: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/2.jpg)
Outline of the lecture
In this lecture, we formulate the problem of linear prediction usingprobabilities. We also introduce the maximum likelihood estimate andshow that it coincides with the least squares estimate. The goal of thelecture is for you to learn:
Gaussian distributions How to formulate the likelihood for linear regression Computing the maximum likelihood estimates for linearregression. Entropy and its relation to loss, probability and learning.
![Page 3: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/3.jpg)
Univariate Gaussian distribution
![Page 4: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/4.jpg)
Sampling from a Gaussian distribution
![Page 5: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/5.jpg)
Covariance, correlation and multivariate Gaussians
![Page 6: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/6.jpg)
Covariance, correlation and multivariate Gaussians
![Page 7: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/7.jpg)
Covariance, correlation and multivariate Gaussians
![Page 8: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/8.jpg)
Bivariate Gaussian distribution example
Assume we have two independent univariate Gaussian variables
x1 = N ( m1 , s 2 ) and x2 = N ( m2 , s 2 )
Their joint distribution p( x1, x2 ) is:
![Page 9: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/9.jpg)
We have n=3 data points y1 = 1, y2 = 0.5, y3 = 1.5, which areindependent and Gaussian with unknown mean q and variance 1:
yi ~ N ( q , 1 ) = q + N ( 0 , 1 )
with likelihood P( y1 y2 y3 |q ) = P( y1 |q ) P( y1 |q ) P( y3 |q ) . Considertwo guesses of q, 1 and 2.5. Which has higher likelihood (probability ofgenerating the three observations)?
Finding the q that maximizes the likelihood is equivalent to moving theGaussian until the product of 3 green bars (likelihood) is maximized.
![Page 10: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/10.jpg)
The likelihood for linear regression
Let us assume that each label yi is Gaussian distributed with mean xiTq
and variance s 2, which in short we write as:
yi = N ( xiTq , s 2 ) = xi
Tq + N ( 0, s 2 )
![Page 11: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/11.jpg)
|
![Page 12: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/12.jpg)
|
![Page 13: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/13.jpg)
Maximum likelihood
![Page 14: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/14.jpg)
The ML estimate of q is:
![Page 15: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/15.jpg)
The ML estimate of s is:
![Page 16: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/16.jpg)
Making predictionsThe ML plugin prediction, given the training data D=( X , y ), for a newinput x* and known s 2 is given by:
P(y| x* ,D, s 2 ) = N (y| x*T q ML , s 2 )
![Page 17: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/17.jpg)
Confidence in the predictions
![Page 18: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/18.jpg)
Bernoulli: a model for coins
A Bernoulli random variable r.v. X takes values in {0,1}
q if x=1p(x|q ) =
1- q if x=0
Where q 2 (0,1). We can write this probability more succinctly asfollows:
![Page 19: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/19.jpg)
Entropy
In information theory, entropy H is a measure of the uncertaintyassociated with a random variable. It is defined as:
H(X) = - p(x|q ) log p(x|q )
Example: For a Bernoulli variable X, the entropy is:
Sx
![Page 20: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/20.jpg)
Entropy of a Gaussian in D dimensions
![Page 21: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/21.jpg)
MLE - propertiesFor independent and identically distributed (i.i.d.) data from p(xjµ0),the MLE minimizes the Kullback-Leibler divergence:
µ̂ = arg maxµ
nY
i=1
p(xijµ)
= arg maxµ
nX
i=1
log p(xijµ)
= arg maxµ
1N
NX
i=1
log p(xijµ)¡ 1N
NX
i=1
log p(xijµ0)
= arg maxµ
1N
NX
i=1
logp(xijµ)
p(xijµ0)
¡! arg minµ
Z
logp(xijµ0)
p(xijµ)p(xjµ0)dx
![Page 22: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/22.jpg)
MLE - propertiesi=1
j
arg minµ
logp(xijµ0)
p(xijµ)p(xjµ0)dx
![Page 23: Maximum Likelihood · 2015. 1. 29. · Microsoft PowerPoint - lecture3.ppt [Compatibility Mode] Author: nandoadmin Created Date: 20150129200730Z](https://reader036.vdocuments.us/reader036/viewer/2022081616/5feae2bb5ab9dd06ff25b4ab/html5/thumbnails/23.jpg)
Next lecture
In the next lecture, we introduce ridge regression, bases functions andlook at the issue of controlling complexity.