![Page 1: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/1.jpg)
Supervised Learning Maximum Likelihood Estimation
![Page 2: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/2.jpg)
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
![Page 3: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/3.jpg)
Why MLE?
¤ Maximum Likelihood Estimation is a very very very very fundamental part of data analysis
¤ “MLE for Gaussians” is training wheels for our future techniques
¤ Learning Gaussians is more useful than you might guess
![Page 4: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/4.jpg)
Why Gaussian (Normal)?
¤ Gaussian distributions are ¤ a family of distributions that have the same general shape
¤ are symmetric with scores more concentrated in the middle than in the tails
¤ Why they are important? ¤ many psychological and educational variables are distributed
approximately normally
¤ easy for mathematical statisticians to work with
¤ Formally ¤ Normal distribution N(µ, σ2) approximates
sums of independent identically distributed
random variables
)2/()(
2
22
21)( σµ
πσ−−= xexf
![Page 5: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/5.jpg)
Learning Gaussians from Data
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ (you do know σ2)
¤ MLE: For which μ is x1, x2, ... xR most likely?
µmle = arg maxµ
p(x1, x2,..., xR |µ,! 2 )
= arg maxµ
p(xi |µ,! 2 )i=1
R
!
= arg maxµ
log p(xi |µ,! 2 )i=1
R
!
= arg maxµ
!(xi !µ)2
2! 2i=1
R
"
= arg minµ
(xi !µ)2
i=1
R
"
by i.i.d
monotonicity of log
plug in the formula of Gaussian
after simplification
![Page 6: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/6.jpg)
MLE Strategy
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ, stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ
4. Solve it
5. Check that you have found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained
![Page 7: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/7.jpg)
The MLE μ
µmle = arg maxµ
p(x1, x2,..., xR |µ,! 2 )
= arg minµ
(xi !µ)2
i=1
R
"
= µ s.t 0=!LL!µ
=!!µ
(xi "µ)2
i=1
R
#
= ! 2(xi !µ)i=1
R
"
µ =1R
xii=1
R
!
![Page 8: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/8.jpg)
Interesting…
¤ The best estimate of the mean of a distribution is the mean of the sample!!
µ =1R
xii=1
R
!
![Page 9: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/9.jpg)
A General MLE Strategy
Suppose θ = (θ1, θ2, ..., θn)T is a vector of parameters
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations
4. Check that you are at the maximum
!LL!!
=
!LL!!1!LL!!2!!LL!!n
"
#
$$$$$$$$$
%
&
'''''''''
![Page 10: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/10.jpg)
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ and σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1
2log! 2 )! 1
2! 2 (xi !µ)2
i=1
R
"
!LL!µ
=1! 2 (xi "µ)
i=1
R
#
!LL!! 2 = "
R2! 2 +
12! 4 (xi "µ)
2
i=1
R
#
![Page 11: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/11.jpg)
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ and σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1
2log! 2 )! 1
2! 2 (xi !µ)2
i=1
R
"
0 = 1! 2 (xi !µ)
i=1
R
"
0 = ! R2! 2 +
12! 4 (xi !µ)
2
i=1
R
"
![Page 12: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/12.jpg)
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ or σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
µmle =1R
xii=1
R
!
! mle2 = !
1R
(xi !µmle )2
i=1
R
"
![Page 13: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/13.jpg)
Exercise (Applying MLE to another model)
James Bond Tasting Martini
¤ We gave to Mr. Bond a series of 16 taste tests
¤ At each test, we flipped a fair coin to determine
whether to stir or shake the martini
¤ Mr. Bond was correct on 13/16 tests
¤ How can we model this experiment?
¤ Use MLE to estimate the parameters of your model.
![Page 14: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/14.jpg)
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
![Page 15: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/15.jpg)
Unbiased Estimators
¤ An estimator if a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameter
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E µmle!" #$= E1R
xii=1
R
%!
"&
#
$'= µ
µmle is unbiased
![Page 16: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/16.jpg)
Biased Estimators
¤ An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameter
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2mle
!" #$= E1R
(xi %µmle )2
i=1
R
&!
"'
#
$(= E
1R
xi %1R
xjj=1
R
&i=1
R
&)
*++
,
-..
2!
"
''
#
$
((/! 2
σ2mle is biased
![Page 17: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/17.jpg)
MLE Variance Bias
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2mle
!" #$= E1R
xi %1R
xjj=1
R
&i=1
R
&'
())
*
+,,
2!
"
--
#
$
.
.= 1% 1
R'
()
*
+,! 2 /! 2
Intuition check: consider the case of R=1
Why should be expect that σ2mle would be
an underestimate of true σ2?
![Page 18: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/18.jpg)
MLE Variance Bias
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2unbiased
!" #$=!2
we define
! 2unbiased =
! 2mle
1! 1R
"
#$
%
&'
E ! 2mle
!" #$= E1R
xi %1R
xjj=1
R
&i=1
R
&'
())
*
+,,
!
"--
#
$..= 1% 1
R'
()
*
+,! 2 /! 2
so
![Page 19: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/19.jpg)
Unbiaseditude discussion
¤ Which is best?
¤ Answer: ¤ It depends on the task
¤ And does not make much difference once R--> large
! mle2 =
1R
(xi !µmle )2
i=1
R
"
! unbiased2 = 1
R!1(xi !µ
mle )2
i=1
R
"
![Page 20: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/20.jpg)
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
![Page 21: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/21.jpg)
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
![Page 22: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/22.jpg)
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
µimle =
1R
!kik=1
R
"
Where 1 ≤ i ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And μi
mle is the ith component of μmle
![Page 23: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/23.jpg)
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
! ijmle =
1R
(!kik=1
R
" #µimle )(!kj #µ j
mle )
Where 1 ≤ i ≤ m, 1 ≤ j ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And σij
mle is the (i,j)th component of Σmle
![Page 24: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/24.jpg)
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
![Page 25: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/25.jpg)
Confidence Intervals
¤ Indicate the reliability of an estimate
¤ How to estimate this interval using hypothesis testing?
![Page 26: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/26.jpg)
Hypothesis Testing
¤ We gave to Mr. Bond a series of 16 taste tests
¤ At each test, we flipped a fair coin to determine whether to stir or shake the martini
¤ Mr. Bond was correct on 13/16 tests: ¤ Is Mr. Bond able to
distinguish between stirred and shacked martini?
¤ Was he just lucky?
James Bond Tasting Martini Using a binomial distribution (N=16, k=13, p=0.5)
¤ P( someone who is lucky
would be correct 13/16 or more)= 0.0106
¤ The hypothesis that Mr. Bond was luky is not proven false (but considerable doubt is cast on it)
¤ There is a strong evidence that Mr. Bond can tell whether a drink was shaken or stirred
From http://onlinestatbook.com/rvls.html
![Page 27: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/27.jpg)
Probability Value (p-value)
¤ In the James Bond example, the computed probability of 0.0106 is
the probability he would be correct on 13 or more taste tests (out of 16) if he was just guessing
¤ Important ¤ This is not the probability that he cannot tell the difference
¤ The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing.)
¤ Using statistical terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.
![Page 28: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/28.jpg)
Null Hypothesis & Statistical Significance
¤ In the previous example, the hypothesis that an effect is due to chance is called the null hypothesis
¤ The null hypothesis is typically the opposite of the researcher's hypothesis
¤ The null hypothesis is rejected when the probability value is lower than a specific threshold (0.05 or 0.01) called test level
¤ When the null hypothesis is rejected, the effect is said to be statistically significant
![Page 29: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/29.jpg)
Statistical Hypothesis Testing
¤ A hypothesis test determines a probability 1-α
(test level α, significance level) that a sample X1,…,Xn from some unknown probability distribution has a certain property
¤ Examples ¤ under the assumption of a normal distribution or has mean m
¤ General Form
Null hypothesis H0 vs, alternative hypothesis H1
Needs test variable X (derived from X1,…,Xn,H0,H1) and
Test region R with
X ∈R for rejecting H0 and
X∉R for retaining H0
Retain H0 Reject H0
H0 true √ Type I error
H1 true Type II error √
![Page 30: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/30.jpg)
Hypothesis and p-values
¤ Hypothesis ¤ A hypothesis of the form θ= θ0 is called a simple hypothesis ¤ A hypothesis of the form θ > θ0 or θ < θ0 is called a composite
hypothesis
¤ Tests ¤ A test of the form H0: θ= θ0 vs.. H1: θ≠θ is called a two-sided test
¤ A test of the form H0: θ <= θ0 vs.. H1: θ>θ0 or H0: θ >= θ0 vs.. H1: θ<θ0 is called a one-sided test
¤ P-value ¤ Small p-value means strong evidence against H0
![Page 31: Supervised Learningmkacimi/mle.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we flipped](https://reader033.vdocuments.us/reader033/viewer/2022060911/60a5a1ef69001d7c7d5aa8a4/html5/thumbnails/31.jpg)
Summary
¤ The Recipe for MLE
¤ Understand MLE estimation of Gaussian parameters
¤ Understand “biased estimator” versus “unbiased estimator”
¤ Understand Confidence Intervals