supervised learningmkacimi/mle.pdf · exercise (applying mle to another model) james bond tasting...
TRANSCRIPT
Supervised Learning Maximum Likelihood Estimation
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
Why MLE?
¤ Maximum Likelihood Estimation is a very very very very fundamental part of data analysis
¤ “MLE for Gaussians” is training wheels for our future techniques
¤ Learning Gaussians is more useful than you might guess
Why Gaussian (Normal)?
¤ Gaussian distributions are ¤ a family of distributions that have the same general shape
¤ are symmetric with scores more concentrated in the middle than in the tails
¤ Why they are important? ¤ many psychological and educational variables are distributed
approximately normally
¤ easy for mathematical statisticians to work with
¤ Formally ¤ Normal distribution N(µ, σ2) approximates
sums of independent identically distributed
random variables
)2/()(
2
22
21)( σµ
πσ−−= xexf
Learning Gaussians from Data
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ (you do know σ2)
¤ MLE: For which μ is x1, x2, ... xR most likely?
µmle = arg maxµ
p(x1, x2,..., xR |µ,! 2 )
= arg maxµ
p(xi |µ,! 2 )i=1
R
!
= arg maxµ
log p(xi |µ,! 2 )i=1
R
!
= arg maxµ
!(xi !µ)2
2! 2i=1
R
"
= arg minµ
(xi !µ)2
i=1
R
"
by i.i.d
monotonicity of log
plug in the formula of Gaussian
after simplification
MLE Strategy
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ, stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ
4. Solve it
5. Check that you have found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained
The MLE μ
µmle = arg maxµ
p(x1, x2,..., xR |µ,! 2 )
= arg minµ
(xi !µ)2
i=1
R
"
= µ s.t 0=!LL!µ
=!!µ
(xi "µ)2
i=1
R
#
= ! 2(xi !µ)i=1
R
"
µ =1R
xii=1
R
!
Interesting…
¤ The best estimate of the mean of a distribution is the mean of the sample!!
µ =1R
xii=1
R
!
A General MLE Strategy
Suppose θ = (θ1, θ2, ..., θn)T is a vector of parameters
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations
4. Check that you are at the maximum
!LL!!
=
!LL!!1!LL!!2!!LL!!n
"
#
$$$$$$$$$
%
&
'''''''''
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ and σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1
2log! 2 )! 1
2! 2 (xi !µ)2
i=1
R
"
!LL!µ
=1! 2 (xi "µ)
i=1
R
#
!LL!! 2 = "
R2! 2 +
12! 4 (xi "µ)
2
i=1
R
#
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ and σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1
2log! 2 )! 1
2! 2 (xi !µ)2
i=1
R
"
0 = 1! 2 (xi !µ)
i=1
R
"
0 = ! R2! 2 +
12! 4 (xi !µ)
2
i=1
R
"
MLE for Univariate Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)
¤ But you don’t know μ or σ2
¤ MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?
µmle =1R
xii=1
R
!
! mle2 = !
1R
(xi !µmle )2
i=1
R
"
Exercise (Applying MLE to another model)
James Bond Tasting Martini
¤ We gave to Mr. Bond a series of 16 taste tests
¤ At each test, we flipped a fair coin to determine
whether to stir or shake the martini
¤ Mr. Bond was correct on 13/16 tests
¤ How can we model this experiment?
¤ Use MLE to estimate the parameters of your model.
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
Unbiased Estimators
¤ An estimator if a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameter
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E µmle!" #$= E1R
xii=1
R
%!
"&
#
$'= µ
µmle is unbiased
Biased Estimators
¤ An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameter
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2mle
!" #$= E1R
(xi %µmle )2
i=1
R
&!
"'
#
$(= E
1R
xi %1R
xjj=1
R
&i=1
R
&)
*++
,
-..
2!
"
''
#
$
((/! 2
σ2mle is biased
MLE Variance Bias
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2mle
!" #$= E1R
xi %1R
xjj=1
R
&i=1
R
&'
())
*
+,,
2!
"
--
#
$
.
.= 1% 1
R'
()
*
+,! 2 /! 2
Intuition check: consider the case of R=1
Why should be expect that σ2mle would be
an underestimate of true σ2?
MLE Variance Bias
¤ If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then
E ! 2unbiased
!" #$=!2
we define
! 2unbiased =
! 2mle
1! 1R
"
#$
%
&'
E ! 2mle
!" #$= E1R
xi %1R
xjj=1
R
&i=1
R
&'
())
*
+,,
!
"--
#
$..= 1% 1
R'
()
*
+,! 2 /! 2
so
Unbiaseditude discussion
¤ Which is best?
¤ Answer: ¤ It depends on the task
¤ And does not make much difference once R--> large
! mle2 =
1R
(xi !µmle )2
i=1
R
"
! unbiased2 = 1
R!1(xi !µ
mle )2
i=1
R
"
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
µimle =
1R
!kik=1
R
"
Where 1 ≤ i ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And μi
mle is the ith component of μmle
MLE for m-dimensional Gaussian
¤ Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)
¤ But you don’t know μ or Σ
¤ MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?
µmle =1R
!kk=1
R
"
!mle = "1R
(#k "µmle )(#k "µ
mle )Tk=1
R
!
! ijmle =
1R
(!kik=1
R
" #µimle )(!kj #µ j
mle )
Where 1 ≤ i ≤ m, 1 ≤ j ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And σij
mle is the (i,j)th component of Σmle
Road Map
1. Learning Univariate Gaussians
2. What is a biased estimator?
3. Learning Multivariate Gaussians
4. Hypothesis Testing
Confidence Intervals
¤ Indicate the reliability of an estimate
¤ How to estimate this interval using hypothesis testing?
Hypothesis Testing
¤ We gave to Mr. Bond a series of 16 taste tests
¤ At each test, we flipped a fair coin to determine whether to stir or shake the martini
¤ Mr. Bond was correct on 13/16 tests: ¤ Is Mr. Bond able to
distinguish between stirred and shacked martini?
¤ Was he just lucky?
James Bond Tasting Martini Using a binomial distribution (N=16, k=13, p=0.5)
¤ P( someone who is lucky
would be correct 13/16 or more)= 0.0106
¤ The hypothesis that Mr. Bond was luky is not proven false (but considerable doubt is cast on it)
¤ There is a strong evidence that Mr. Bond can tell whether a drink was shaken or stirred
From http://onlinestatbook.com/rvls.html
Probability Value (p-value)
¤ In the James Bond example, the computed probability of 0.0106 is
the probability he would be correct on 13 or more taste tests (out of 16) if he was just guessing
¤ Important ¤ This is not the probability that he cannot tell the difference
¤ The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing.)
¤ Using statistical terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.
Null Hypothesis & Statistical Significance
¤ In the previous example, the hypothesis that an effect is due to chance is called the null hypothesis
¤ The null hypothesis is typically the opposite of the researcher's hypothesis
¤ The null hypothesis is rejected when the probability value is lower than a specific threshold (0.05 or 0.01) called test level
¤ When the null hypothesis is rejected, the effect is said to be statistically significant
Statistical Hypothesis Testing
¤ A hypothesis test determines a probability 1-α
(test level α, significance level) that a sample X1,…,Xn from some unknown probability distribution has a certain property
¤ Examples ¤ under the assumption of a normal distribution or has mean m
¤ General Form
Null hypothesis H0 vs, alternative hypothesis H1
Needs test variable X (derived from X1,…,Xn,H0,H1) and
Test region R with
X ∈R for rejecting H0 and
X∉R for retaining H0
Retain H0 Reject H0
H0 true √ Type I error
H1 true Type II error √
Hypothesis and p-values
¤ Hypothesis ¤ A hypothesis of the form θ= θ0 is called a simple hypothesis ¤ A hypothesis of the form θ > θ0 or θ < θ0 is called a composite
hypothesis
¤ Tests ¤ A test of the form H0: θ= θ0 vs.. H1: θ≠θ is called a two-sided test
¤ A test of the form H0: θ <= θ0 vs.. H1: θ>θ0 or H0: θ >= θ0 vs.. H1: θ<θ0 is called a one-sided test
¤ P-value ¤ Small p-value means strong evidence against H0
Summary
¤ The Recipe for MLE
¤ Understand MLE estimation of Gaussian parameters
¤ Understand “biased estimator” versus “unbiased estimator”
¤ Understand Confidence Intervals