parameter estimation: maximum likelihood estimation chapter 3 (duda et al.) – sections 3.1-3.2...

Parameter Estimation:Maximum Likelihood Estimation

Chapter 3 (Duda et al.) – Sections 3.1-3.2

CS479/679 Pattern RecognitionDr. George Bebis

Parameter Estimation

• Bayesian Decision Theory allows us to design an optimal classifier given that we know P(i) and p(x/i):

• Estimating P(i) is usually not difficult.

• Estimating p(x/i) is more difficult:– Number of samples is often too small– Dimensionality of feature space is large.

( / ) ( )( / )

( )j j

j

p x PP x

p x

• Assumptions– A set of training samples D ={x1, x2, ...., xn}, where the samples

were drawn according to p(x|j).

– p(x|j) has some known parametric form:

• Parameter estimation problem:

Parameter Estimation (cont’d)

Given D, find the best possible

also denoted as p(x / ) where =(μi , Σi)

e.g., p(x /i) ~ N(μ i , i)

Main Methods in Parameter Estimation

• Maximum Likelihood (ML)

• Bayesian Estimation (BE)

Main Methods in Parameter Estimation

• Maximum Likelihood (ML)– Assumes that the values of the parameters are fixed but

unknown.– Best estimate is obtained by maximizing the probability

of obtaining the samples x1,x2,..,xn actually observed (i.e., training data):

1 2( , ,..., / ) ( / )np p Dx x x θ θ

Main Methods in Parameter Estimation (cont’d)

• Bayesian Estimation (BE)– Assumes that the parameters θare random variables

that have some known a-priori distribution p(θ.– Estimates a distribution rather than making point

estimates like ML:

( / ) ( / ) ( / )p D p p D dx x θ θ θ

Note: the BE solution might not be of the parametric form assumed!

ML Estimation - Assumptions

• Let us assume c classes and that the training data consists of c sets (i.e., one for each class):

• Samples in Dj have been drawn independently according to p(x/ωj).

• p(x/ωj) has known parametric form with parameters j :

e.g., j =(μj , Σj) for Gaussian distribution

D1, D2, ...,Dc

ML Estimation - Problem Formulation and Solution

• Problem: given D1, D2, ...,Dc and a model for each class, estimate

• If the samples in Dj give no information about i ( ),

we need to solve c independent problems (i.e., one for each class)

• The ML estimate for D={x1,x2,..,xn} is the value that maximizes p(D /) (i.e., best supports the training

data).

i j

1 21

( / ) ( , ,..., / ) ( / )n

n kk

p D p p

θ x x x θ x θ

θ̂

1, 2,…, c

(using independence assumption)

ML Estimation - Problem Definition and Solution (cont’d)

• How should we find the maximum of p(D/) ?

( / ) 0p D θ θ

where

ML Estimation Using Log-Likelihood

• Consider the log-likelihood for simplicity:

• The solution maximizes ln p(D/ θ) θ̂

1

ln ( / ) 0 ln ( / ) 0n

kk

p D or p

θ θθ x θ

ˆ arg max ln ( / )p D θθ θ

1

ln ( / ) ln ( / )n

kk

p D p

θ x θ

1 21

( / ) ( , ,..., / ) ( / )n

n kk

p D p p

θ x x x θ x θ

ML Estimation Using Log-Likelihood (cont’d)

ln p(D/ θ)

p(D / θ) ==μμ̂θ

training data,

unknown mean,

known variance

ML for Multivariate Gaussian Density:Case of Unknown θ=μ

( / ) ~ ( , )p N x μ μ• Consider

• Computing the gradient, we have

11 1ln ( / ) ( ) ( ) ln 2 ln | |

2 2 2t d

p x μ x μ x μ

1ln ( / ) ln ( / ) ( )k kk k

p D p μ μμ x μ x μ

• Setting we have:

• The solution is given by

The ML estimate is simply the “sample mean”.

ML for Multivariate Gaussian Density:Case of Unknown θ=μ (cont’d)

ln ( / ) 0p D μ μ

μ̂1

1ˆ

n

kkn

μ x

1

1 1

( ) 0 0n n

k kk k

or n

x μ x μ

Special Case of ML: Maximum A-Posteriori Estimator

(MAP)• Assume that θ is a random variable with known p(θ).

• Maximize p(θ/D) or p(D/θ)p(θ) or ln p(D/ θ)p(θ):

( / ) ( )( / )

( )

p D pp D

p D

θ θθ

11

( / ) ( ) ln ( / ) ln ( )n n

k kkk

p p p p

x θ θ x θ θ

Consider:

Special Case of ML: Maximum A-Posteriori Estimator

(MAP)•What happens when p(θ) is uniform?

1

ln ( / ) ln ( )n

kk

p p

x θ θ

MAP is equivalent to ML

1

ln ( / )n

kk

p x θ

MAP for Multivariate Gaussian Density:Case of Unknown θ=μ

• Assume

• MAP maximizes ln p(D/ μ)p(μ):

( / ) ~ ( , ( ))p N Diag μx μ μ

0( ) ~ ( , (σ ))p N Diag 0 μμ μ

1

( ln ( / ) ln ( )) 0n

kk

p p

μ x μ μ

1

ln ( / ) ln ( )n

kk

p p

x μ μ

maximize

where(known)

MAP for Multivariate Gaussian Density:Case of Unknown θ=μ (cont’d)

• If , then

• What happens when

0

00

2

21

22 21

2

1 1ˆ( ) ( ) 0

1

n

knk

kk

or

n

μ0

μ0

μμ μ

μ

μ x

x μ μ μ μ

0

2

21

μ

μ

ˆ 0μ μ

00 ? μ

1

1ˆ

n

kkn

μ x

ML for Univariate Gaussian Density:Case of Unknown θ=(μ,σ2)

• Assume 2( / ) ~ ( , )p x N θ

2 22

22 1

2

1 1ln (x / ) ln 2 (x )

2 21 1

ln (x / ) ln 2 (x )2 2

k k

k k

p or

p

θ

θ

p(xp(xkk//θθ))

p(xk/θ)

p(xk/θ)

θ =(θ1,θ2)=(μ,σ2)

ML for Univariate Gaussian Density:Case of Unknown θ=(μ,σ2) (cont’d)

=0=0

p(xp(xkk//θθ)=0)=0

• The solutions are given by:

=0=0

sample mean

sample variance

ML for Multivariate Gaussian Density:Case of Unknown θ=(μ,Σ)

• In the general case (i.e., multivariate Gaussian) the solutions are:

1

1ˆ

n

kkn

μ x

1

1ˆ ˆ ˆ( )( )n

tk k

kn

x μ x μ

sample mean

sample covariance

Biased and Unbiased Estimates

• An estimate is unbiased when

where θ is the true value.• The ML estimate is unbiased, i.e.,

• The ML estimate and is biased:

θ̂

ˆ[ ]E θ θ

μ̂

ˆ[ ]E μ μ

σ̂ ̂

2 21ˆ[ ]

nE

n

σ1ˆ[ ]

nE

n

Biased and Unbiased Estimates (cont’d)

• The following are unbiased estimates of andσ̂ ̂

2

1

1ˆ ˆ( )

1

n

kkn

x μ

1

1ˆ ˆ ˆ( )( )1

nt

k kkn

x μ x μ

Comments about ML

• ML estimation is usually simpler than alternative methods.

• It provides more accurate estimates as the number of training samples increases.

• If the model chosen for p(x/ θ) is correct, and independence assumptions among samples are true, ML will give very good results.

parameter estimation: maximum likelihood estimation chapter 3 (duda et al.) – sections 3.1-3.2...

Documents