supervised learningmkacimi/mle.pdf · exercise (applying mle to another model) james bond tasting...

Post on 19-Jan-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Supervised Learning Maximum Likelihood Estimation

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Why MLE?

¤  Maximum Likelihood Estimation is a very very very very fundamental part of data analysis

¤  “MLE for Gaussians” is training wheels for our future techniques

¤  Learning Gaussians is more useful than you might guess

Why Gaussian (Normal)?

¤  Gaussian distributions are ¤  a family of distributions that have the same general shape

¤  are symmetric with scores more concentrated in the middle than in the tails

¤  Why they are important? ¤  many psychological and educational variables are distributed

approximately normally

¤  easy for mathematical statisticians to work with

¤  Formally ¤  Normal distribution N(µ, σ2) approximates

sums of independent identically distributed

random variables

)2/()(

2

22

21)( σµ

πσ−−= xexf

Learning Gaussians from Data

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ (you do know σ2)

¤  MLE: For which μ is x1, x2, ... xR most likely?

µmle = arg maxµ

p(x1, x2,..., xR |µ,! 2 )

= arg maxµ

p(xi |µ,! 2 )i=1

R

!

= arg maxµ

log p(xi |µ,! 2 )i=1

R

!

= arg maxµ

!(xi !µ)2

2! 2i=1

R

"

= arg minµ

(xi !µ)2

i=1

R

"

by i.i.d

monotonicity of log

plug in the formula of Gaussian

after simplification

MLE Strategy

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ, stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ

4.  Solve it

5.  Check that you have found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained

The MLE μ

µmle = arg maxµ

p(x1, x2,..., xR |µ,! 2 )

= arg minµ

(xi !µ)2

i=1

R

"

= µ s.t 0=!LL!µ

=!!µ

(xi "µ)2

i=1

R

#

= ! 2(xi !µ)i=1

R

"

µ =1R

xii=1

R

!

Interesting…

¤  The best estimate of the mean of a distribution is the mean of the sample!!

µ =1R

xii=1

R

!

A General MLE Strategy

Suppose θ = (θ1, θ2, ..., θn)T is a vector of parameters

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ,stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Solve the set of simultaneous equations

4.  Check that you are at the maximum

!LL!!

=

!LL!!1!LL!!2!!LL!!n

"

#

$$$$$$$$$

%

&

'''''''''

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ and σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1

2log! 2 )! 1

2! 2 (xi !µ)2

i=1

R

"

!LL!µ

=1! 2 (xi "µ)

i=1

R

#

!LL!! 2 = "

R2! 2 +

12! 4 (xi "µ)

2

i=1

R

#

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ and σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

log p(x1, x2,..., xR |µ,!2 ) = !R(log" + 1

2log! 2 )! 1

2! 2 (xi !µ)2

i=1

R

"

0 = 1! 2 (xi !µ)

i=1

R

"

0 = ! R2! 2 +

12! 4 (xi !µ)

2

i=1

R

"

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ or σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

µmle =1R

xii=1

R

!

! mle2 = !

1R

(xi !µmle )2

i=1

R

"

Exercise (Applying MLE to another model)

James Bond Tasting Martini

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine

whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests

¤  How can we model this experiment?

¤  Use MLE to estimate the parameters of your model.

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Unbiased Estimators

¤  An estimator if a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameter

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E µmle!" #$= E1R

xii=1

R

%!

"&

#

$'= µ

µmle is unbiased

Biased Estimators

¤  An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameter

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E ! 2mle

!" #$= E1R

(xi %µmle )2

i=1

R

&!

"'

#

$(= E

1R

xi %1R

xjj=1

R

&i=1

R

&)

*++

,

-..

2!

"

''

#

$

((/! 2

σ2mle is biased

MLE Variance Bias

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E ! 2mle

!" #$= E1R

xi %1R

xjj=1

R

&i=1

R

&'

())

*

+,,

2!

"

--

#

$

.

.= 1% 1

R'

()

*

+,! 2 /! 2

Intuition check: consider the case of R=1

Why should be expect that σ2mle would be

an underestimate of true σ2?

MLE Variance Bias

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E ! 2unbiased

!" #$=!2

we define

! 2unbiased =

! 2mle

1! 1R

"

#$

%

&'

E ! 2mle

!" #$= E1R

xi %1R

xjj=1

R

&i=1

R

&'

())

*

+,,

!

"--

#

$..= 1% 1

R'

()

*

+,! 2 /! 2

so

Unbiaseditude discussion

¤  Which is best?

¤  Answer: ¤  It depends on the task

¤  And does not make much difference once R--> large

! mle2 =

1R

(xi !µmle )2

i=1

R

"

! unbiased2 = 1

R!1(xi !µ

mle )2

i=1

R

"

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!

µimle =

1R

!kik=1

R

"

Where 1 ≤ i ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And μi

mle is the ith component of μmle

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

!kk=1

R

"

!mle = "1R

(#k "µmle )(#k "µ

mle )Tk=1

R

!

! ijmle =

1R

(!kik=1

R

" #µimle )(!kj #µ j

mle )

Where 1 ≤ i ≤ m, 1 ≤ j ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And σij

mle is the (i,j)th component of Σmle

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Confidence Intervals

¤  Indicate the reliability of an estimate

¤  How to estimate this interval using hypothesis testing?

Hypothesis Testing

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests: ¤  Is Mr. Bond able to

distinguish between stirred and shacked martini?

¤  Was he just lucky?

James Bond Tasting Martini Using a binomial distribution (N=16, k=13, p=0.5)

¤  P( someone who is lucky

would be correct 13/16 or more)= 0.0106

¤  The hypothesis that Mr. Bond was luky is not proven false (but considerable doubt is cast on it)

¤  There is a strong evidence that Mr. Bond can tell whether a drink was shaken or stirred

From http://onlinestatbook.com/rvls.html

Probability Value (p-value)

¤  In the James Bond example, the computed probability of 0.0106 is

the probability he would be correct on 13 or more taste tests (out of 16) if he was just guessing

¤  Important ¤  This is not the probability that he cannot tell the difference

¤  The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing.)

¤  Using statistical terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.

Null Hypothesis & Statistical Significance

¤  In the previous example, the hypothesis that an effect is due to chance is called the null hypothesis

¤  The null hypothesis is typically the opposite of the researcher's hypothesis

¤  The null hypothesis is rejected when the probability value is lower than a specific threshold (0.05 or 0.01) called test level

¤  When the null hypothesis is rejected, the effect is said to be statistically significant

Statistical Hypothesis Testing

¤  A hypothesis test determines a probability 1-α

(test level α, significance level) that a sample X1,…,Xn from some unknown probability distribution has a certain property

¤  Examples ¤  under the assumption of a normal distribution or has mean m

¤  General Form

Null hypothesis H0 vs, alternative hypothesis H1

Needs test variable X (derived from X1,…,Xn,H0,H1) and

Test region R with

X ∈R for rejecting H0 and

X∉R for retaining H0

Retain H0 Reject H0

H0 true √ Type I error

H1 true Type II error √

Hypothesis and p-values

¤  Hypothesis ¤  A hypothesis of the form θ= θ0 is called a simple hypothesis ¤  A hypothesis of the form θ > θ0 or θ < θ0 is called a composite

hypothesis

¤  Tests ¤  A test of the form H0: θ= θ0 vs.. H1: θ≠θ is called a two-sided test

¤  A test of the form H0: θ <= θ0 vs.. H1: θ>θ0 or H0: θ >= θ0 vs.. H1: θ<θ0 is called a one-sided test

¤  P-value ¤  Small p-value means strong evidence against H0

Summary

¤  The Recipe for MLE

¤  Understand MLE estimation of Gaussian parameters

¤  Understand “biased estimator” versus “unbiased estimator”

¤  Understand Confidence Intervals

top related