mle vs. mapaarti/class/10701/slides/lecture3.pdf · 2010-09-15 · 1 mle vs. map aarti singh...
TRANSCRIPT
![Page 1: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/1.jpg)
1
MLE vs. MAP
Aarti Singh
Machine Learning 10-701/15-781Sept 15, 2010
![Page 2: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/2.jpg)
MLE vs. MAP
2When is MAP same as MLE?
Maximum Likelihood estimation (MLE)
Choose value that maximizes the probability of observed data
Maximum a posteriori (MAP) estimation
Choose value that is most probable given observed data and prior belief
![Page 3: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/3.jpg)
MAP using Conjugate Prior
Coin flip problem
Likelihood is ~ Binomial
If prior is Beta distribution,
Then posterior is Beta distribution
For Binomial, conjugate prior is Beta distribution.3
![Page 4: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/4.jpg)
MLE vs. MAP
4
• Beta prior equivalent to extra coin flips (regularization)• As n →1, prior is “forgotten”• But, for small sample size, prior is important!
What if we toss the coin too few times?
• You say: Probability next toss is a head = 0
• Billionaire says: You’re fired! …with prob 1
![Page 5: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/5.jpg)
Bayesians vs. Frequentists
5
You are no good when sample is
small You give a different
answer for different
priors
![Page 6: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/6.jpg)
What about continuous variables?
6
• Billionaire says: If I am measuring a continuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
= N(m,s2)
m=0 m=0
s2s2
![Page 7: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/7.jpg)
Gaussian distribution
7
Data, D =
• Parameters: m – mean, s2 - variance
• Sleep hrs are i.i.d.:– Independent events
– Identically distributed according to Gaussian distribution
6543 7 8 9 Sleep hrs
![Page 8: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/8.jpg)
Properties of Gaussians
8
• affine transformation (multiplying by scalar and adding a constant)
– X ~ N(m,s2)
– Y = aX + b ! Y ~ N(am+b,a2s2)
• Sum of Gaussians
– X ~ N(mX,s2X)
– Y ~ N(mY,s2
Y)
– Z = X+Y ! Z ~ N(mX+mY, s2
X+s2Y)
![Page 9: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/9.jpg)
MLE for Gaussian mean and variance
9
![Page 10: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/10.jpg)
MLE for Gaussian mean and variance
Note: MLE for the variance of a Gaussian is biased
– Expected result of estimation is not true parameter!
– Unbiased variance estimator:
10
![Page 11: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/11.jpg)
MAP for Gaussian mean and variance
11
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribution
• Prior for mean:
= N(h,l2)
![Page 12: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/12.jpg)
MAP for Gaussian Mean
12
MAP under Gauss-Wishart prior - Homework
(Assuming knownvariance s2)
![Page 13: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/13.jpg)
What you should know…
• Learning parametric distributions: form known, parameters unknown
– Bernoulli (q, probability of flip)
– Gaussian (m, mean and s2, variance)
• MLE
• MAP
13
![Page 14: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/14.jpg)
What loss function are we minimizing?
• Learning distributions/densities – Unsupervised learning
• Task: Learn (know form of P, except q)
• Experience: D =
• Performance:
14
Negative log Likelihood loss
![Page 15: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/15.jpg)
Recitation Tomorrow!
• Linear Algebra and Matlab
• Strongly recommended!!
• Place: NSH 1507 (Note: change from last time)
• Time: 5-6 pm
15
Leman
![Page 16: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/16.jpg)
Bayes Optimal Classifier
Aarti Singh
Machine Learning 10-701/15-781Sept 15, 2010
![Page 17: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/17.jpg)
17
Goal:
Classification
SportsScienceNews
Features, X Labels, Y
Probability of Error
![Page 18: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/18.jpg)
Optimal Classification
Optimal predictor:(Bayes classifier)
18
• Even the optimal classifier makes mistakes R(f*) > 0• Optimal classifier depends on unknown distribution
Bayes risk
![Page 19: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/19.jpg)
Optimal Classifier
Bayes Rule:
Optimal classifier:
19
Class conditional density
Class prior
![Page 20: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/20.jpg)
Example Decision Boundaries
• Gaussian class conditional densities (1-dimension/feature)
20
Decision Boundary
![Page 21: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/21.jpg)
Example Decision Boundaries
• Gaussian class conditional densities (2-dimensions/features)
21
Decision Boundary
![Page 22: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/22.jpg)
Learning the Optimal Classifier
Optimal classifier:
22
Need to know Prior P(Y = y) for all y
Likelihood P(X=x|Y = y) for all x,y
Class conditional density
Class prior
![Page 23: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/23.jpg)
Learning the Optimal Classifier
Task: Predict whether or not a picnic spot is enjoyable
Training Data:
Lets learn P(Y|X) – how many parameters?
23
X = (X1 X2 X3 … … Xd) Y
Prior: P(Y = y) for all y
Likelihood: P(X=x|Y = y) for all x,y
n rows
K-1 if K labels
(2d – 1)K if d binary features
![Page 24: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010](https://reader034.vdocuments.us/reader034/viewer/2022042811/5fa7bfef2df1a85cbc0d7e27/html5/thumbnails/24.jpg)
Learning the Optimal Classifier
Task: Predict whether or not a picnic spot is enjoyable
Training Data:
Lets learn P(Y|X) – how many parameters?
24
X = (X1 X2 X3 … … Xd) Y
2dK – 1 (K classes, d binary features)
n rows
Need n >> 2dK – 1 number of training data to learn all parameters