supervised learningmkacimi/mle.pdf · exercise (applying mle to another model) james bond tasting...

Supervised Learning Maximum Likelihood Estimation

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Why MLE?

¤  Maximum Likelihood Estimation is a very very very very fundamental part of data analysis

¤  “MLE for Gaussians” is training wheels for our future techniques

¤  Learning Gaussians is more useful than you might guess

Why Gaussian (Normal)?

¤  Gaussian distributions are ¤  a family of distributions that have the same general shape

¤  are symmetric with scores more concentrated in the middle than in the tails

¤  Why they are important? ¤  many psychological and educational variables are distributed

approximately normally

¤  easy for mathematical statisticians to work with

¤  Formally ¤  Normal distribution N(µ, σ2) approximates

sums of independent identically distributed

random variables

)2/()(

2

22

21)( σµ

πσ−−= xexf

Learning Gaussians from Data

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ (you do know σ2)

¤  MLE: For which μ is x1, x2, ... xR most likely?

µmle = arg maxµ

p(x1, x2,..., xR |µ,! 2 )

= arg maxµ

p(xi |µ,! 2 )i=1

R

!

= arg maxµ

log p(xi |µ,! 2 )i=1

R

!

= arg maxµ

!(xi !µ)2

2! 2i=1

R

"

= arg minµ

(xi !µ)2

i=1

R

"

by i.i.d

monotonicity of log

plug in the formula of Gaussian

after simplification

MLE Strategy

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ, stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ

4.  Solve it

5.  Check that you have found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained

The MLE μ

µmle = arg maxµ

p(x1, x2,..., xR |µ,! 2 )

= arg minµ

(xi !µ)2

i=1

R

"

= µ s.t 0=!LL!µ

=!!µ

(xi "µ)2

i=1

R

#

= ! 2(xi !µ)i=1

R

"

µ =1R

xii=1

R

!

Interesting…

¤  The best estimate of the mean of a distribution is the mean of the sample!!

µ =1R

xii=1

R

!

A General MLE Strategy

Suppose θ = (θ1, θ2, ..., θn)T is a vector of parameters

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ,stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Solve the set of simultaneous equations

4.  Check that you are at the maximum

!LL!!

=

!LL!!1!LL!!2!!LL!!n

"

#

$$$$$$$$$

%

&

'''''''''

MLE for Univariate Gaussian


¤  But you don’t know μ and σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1

2log! 2 )! 1

2! 2 (xi !µ)2

i=1

R

"

!LL!µ

=1! 2 (xi "µ)

i=1

R

#

!LL!! 2 = "

R2! 2 +

12! 4 (xi "µ)

2

i=1

R

#



¤  But you don’t know μ and σ2


log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1

2log! 2 )! 1

2! 2 (xi !µ)2

i=1

R

"

0 = 1! 2 (xi !µ)

i=1

R

"

0 = ! R2! 2 +

12! 4 (xi !µ)

2

i=1

R

"



¤  But you don’t know μ or σ2


µmle =1R

xii=1

R

!

! mle2 = !

1R

(xi !µmle )2

i=1

R

"

Exercise (Applying MLE to another model)

James Bond Tasting Martini

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine

whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests

¤  How can we model this experiment?

¤  Use MLE to estimate the parameters of your model.

Road Map





Unbiased Estimators

¤  An estimator if a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameter

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E µmle!" #$= E1R

xii=1

R

%!

"&

#

$'= µ

µmle is unbiased

Biased Estimators

¤  An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameter


E ! 2mle

!" #$= E1R

(xi %µmle )2

i=1

R

&!

"'

#

$(= E

1R

xi %1R

xjj=1

R

&i=1

R

&)

*++

,

-..

2!

"

''

#

$

((/! 2

σ2mle is biased

MLE Variance Bias


E ! 2mle

!" #$= E1R

xi %1R

xjj=1

R

&i=1

R

&'

())

*

+,,

2!

"

--

#

$

.

.= 1% 1

R'

()

*

+,! 2 /! 2

Intuition check: consider the case of R=1

Why should be expect that σ2mle would be

an underestimate of true σ2?

MLE Variance Bias


E ! 2unbiased

!" #$=!2

we define

! 2unbiased =

! 2mle

1! 1R

"

#$

%

&'

E ! 2mle

!" #$= E1R

xi %1R

xjj=1

R

&i=1

R

&'

())

*

+,,

!

"--

#

$..= 1% 1

R'

()

*

+,! 2 /! 2

so

Unbiaseditude discussion

¤  Which is best?

¤  Answer: ¤  It depends on the task

¤  And does not make much difference once R--> large

! mle2 =

1R

(xi !µmle )2

i=1

R

"

! unbiased2 = 1

R!1(xi !µ

mle )2

i=1

R

"

Road Map





MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!





µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!

µimle =

1R

!kik=1

R

"

Where 1 ≤ i ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And μi

mle is the ith component of μmle





µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!

! ijmle =

1R

(!kik=1

R

" #µimle )(!kj #µ j

mle )

Where 1 ≤ i ≤ m, 1 ≤ j ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And σij

mle is the (i,j)th component of Σmle

Road Map





Confidence Intervals

¤  Indicate the reliability of an estimate

¤  How to estimate this interval using hypothesis testing?

Hypothesis Testing

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests: ¤  Is Mr. Bond able to

distinguish between stirred and shacked martini?

¤  Was he just lucky?

James Bond Tasting Martini Using a binomial distribution (N=16, k=13, p=0.5)

¤  P( someone who is lucky

would be correct 13/16 or more)= 0.0106

¤  The hypothesis that Mr. Bond was luky is not proven false (but considerable doubt is cast on it)

¤  There is a strong evidence that Mr. Bond can tell whether a drink was shaken or stirred

From http://onlinestatbook.com/rvls.html

Probability Value (p-value)

¤  In the James Bond example, the computed probability of 0.0106 is

the probability he would be correct on 13 or more taste tests (out of 16) if he was just guessing

¤  Important ¤  This is not the probability that he cannot tell the difference

¤  The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing.)

¤  Using statistical terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.

Null Hypothesis & Statistical Significance

¤  In the previous example, the hypothesis that an effect is due to chance is called the null hypothesis

¤  The null hypothesis is typically the opposite of the researcher's hypothesis

¤  The null hypothesis is rejected when the probability value is lower than a specific threshold (0.05 or 0.01) called test level

¤  When the null hypothesis is rejected, the effect is said to be statistically significant

Statistical Hypothesis Testing

¤  A hypothesis test determines a probability 1-α

(test level α, significance level) that a sample X1,…,Xn from some unknown probability distribution has a certain property

¤  Examples ¤  under the assumption of a normal distribution or has mean m

¤  General Form

Null hypothesis H0 vs, alternative hypothesis H1

Needs test variable X (derived from X1,…,Xn,H0,H1) and

Test region R with

X ∈R for rejecting H0 and

X∉R for retaining H0

Retain H0 Reject H0

H0 true √ Type I error

H1 true Type II error √

Hypothesis and p-values

¤  Hypothesis ¤  A hypothesis of the form θ= θ0 is called a simple hypothesis ¤  A hypothesis of the form θ > θ0 or θ < θ0 is called a composite

hypothesis

¤  Tests ¤  A test of the form H0: θ= θ0 vs.. H1: θ≠θ is called a two-sided test

¤  A test of the form H0: θ <= θ0 vs.. H1: θ>θ0 or H0: θ >= θ0 vs.. H1: θ<θ0 is called a one-sided test

¤  P-value ¤  Small p-value means strong evidence against H0

Summary

¤  The Recipe for MLE

¤  Understand MLE estimation of Gaussian parameters

¤  Understand “biased estimator” versus “unbiased estimator”

¤  Understand Confidence Intervals

supervised learningmkacimi/mle.pdf · exercise (applying mle to another model) james bond tasting...

Documents