2011 summer training course estimation theory chapter 7 maximum likelihood estimation

Wireless Information Transmission System Lab.Wireless Information Transmission System Lab.

Institute of Communications EngineeringInstitute of Communications Engineering

National Sun Yat-sen UniversityNational Sun Yat-sen University

2011 Summer Training Course2011 Summer Training CourseESTIMATIONESTIMATION THEORYTHEORY

Chapter 7Chapter 7Maximum Likelihood EstimationMaximum Likelihood Estimation

OutlineOutline

◊ Why use MLE?

◊ How to find the MLE?

◊ Properties of MLE

◊ Numerical Determination of the MLE

2

IntroductionIntroduction

◊ We now investigate an alternative to the MVU estimator, which is desirable in situations where the MVU estimator does not exist or cannot be found even if it does exist.

◊ This estimator, which is based on the maximum likelihood principle, is overwhelmingly the most popular approach to obtaining practical estimator.

◊ It has the distinct advantage of being a turn-the-crank procedure, allowing it to be implemented for complicated estimation problems.

3

◊ In general, the MLE has the asymptotic properties of being unbiased, achieving the CRLB, and having a Gaussian PDF.

4

Why use MLE?Why use MLE?

◊ Example 7.1 - DC Level White Gaussian Noise◊ Consider the observed data set

where A is an unknown level , which is assumed to be positive

(A > 0) , and is WGN with unknown variance A

( ). ◊ The PDF is :

5

1

2

02

1 1; exp . (7.1)

22

N

Nn

p A x n AAA

x

◊ Taking the derivative of the log-likelihood function, we

have ：

◊ We can still determine the CRLB for this problem to find that ：

◊ We next try to find the MVU estimator by resorting to the theory of sufficient statistics.

6

1 12

20 0

?

ln ; 1 1

2 2

ˆ .

N N

n n

p A Nx n A x n A

A A A A

I A A A

x

2

2

1ˆvarln ;p

E

x

2

12

ˆvar . (7.2)A

AN A

◊ Sufficient statistics◊ Theorem 5.1(p.104)◊ Theorem 5.2(p.109)

◊ If is an unbiased estimator of and is a sufficient

statistic for then is

I. A valid estimator for ( not dependent on )

II. Unbiased

III. Of lesser or equal variance than that of , for all .

Additionally, if the sufficient statistic is complete, then is the MVU estimator. In essence, a statistic is complete if there is only one function of the statistic that is unbiased

; ,p g T h x x x

◊ First approach ： Use Theorem 5.1◊ Two steps ：

◊ First step ： Find ◊ Second step ： Find function g so that is an unbiased

estimator of A

Cont.

A = g(T)

,g T hx x

◊ First step ：◊ Attempting to factor (7.1) into the form of (5.3),we note that

◊ so that the PDF factor as

◊ Based on the Neyman-Fisher factorization theorem a single sufficient statistic for A is .

2

12

0

12

0

,

1 1 1; exp exp

22N

N

n

N

nh x

g x n A

p A x n NA NxAA

x

1 1

2 2

0 0

1 12

N N

n n

x n A x n Nx NAA A

1

0

1 N

nx x n

N

◊ Second step ：◊ Assuming that is a complete sufficient statistic . To do so

we need to find a function g such that

since

◊ It is not obvious how to choose .g

1

2

0

0N

n

E g x n A for all A

12 2

0

2

2

var

N

n

E x n NE x n

N x n E x n

N A A

◊ Second approach ： Use Theorem 5.2◊ It would be to determine the conditional expectation

where is any unbiased estimator.

◊ Example ： Let , then the MVU estimator would take the form

◊ Unfortunately , the evaluation of the conditional expectation

appears to be a formidable task.

1 2

0ˆ N

nE A x n

A

ˆ 0A x

1

2

0

0N

n

E x x n

◊ Example 7.2 - DC level in White Gaussian Noise◊ We propose the following estimator ：

◊ This estimator is biased since

1

2

0

1 1 1ˆ . (7.6)2 4

N

n

A x nN

12

0

12

0

2

1 1 1ˆ2 4

1 1 1

2 4

1 1

2 4.

N

n

N

n

E A E x nN

E x nN

A A

A

◊ As ,we have by the law of large number

and therefore from (7.6)

◊ To find the mean and variance of as we use the

statistical linearization argument described in section 3.6 .

◊ Section 3.6

is a estimator of DC level (

)

1

2 2 2

0

1 N

n

x n E x n A AN

A AA

◊ It might be supposed that is efficient for .◊ Let .If we linearize about A , we have the

approximation

Then ,the estimator is unbiased.

Also, the estimator achieves the CRLB

.dg Ag x g A x A

dA

2E g x g A A

2

2 2

2 2

var var

2

4

dg Ag x x

dA

A

N

A

N

◊ The is an estimator of .

◊ Let be that function , so that

where , and therefore,

◊ Linearizing about ,we have

or

1 2

0

1 N

nx n

N

2A A

g A g u

1 2

0

1 N

nu x n

N

1 1.

2 4g u u

12

0

1 1 1ˆ .2 4

N

n

A x nN

20u E u A A

0 00

|dg ug u g u u u

u udu

11

2 22

102

1ˆ (7.7)N

n

A A x n A AA N

◊ It now follows that the asymptotic mean is

so that is asymptotically unbiased . Additionally , the

asymptotic variance becomes , from (7.7)

◊ But can be shown to be (prove is in next page), so that

ˆE A A

A

211

22

102

124

212

1ˆvar var

var .

N

n

A x nA N

x nN A

2var x n 3 24 2A A

124

212

2

12

1ˆvar 42

A A AN A

A

N A

(asymptotically

efficient)

2 4 2 2

2

22

p.574

If 0, , then the momen

va

ts of are

1 3 1 even

0 od

r

d

x

k

k x

by

x x

k kE x

x n E x n E x n

k

：

N

44

4 3 2 2 3 4

4 2 2

4 3

4 6 4

0 6 0 3

7.1:

[ ] [ ] 0,1,..., -1

~ (0

, .

6

)

EX

x n

E x n E A w

E A A w A w Aw w

A A A A

A w n n N

w N A

A A

2

2 2 2 2

22 4 3 2 2

4 3 2 4 3 2

3 2#

3

var

var 6 3

6 3 2

4 2

A

and E x n x n E x n A A

x n A A A A A

A A A A A A

A A

◊ Summarizing our result :

a. The proposed estimator given by (7.6) is asymptotically unbiased and asymptotically achieves the CRLB .Hence , it is asymptotically efficient.

b. Furthermore , by the central limit theorem the random variable

is Gaussian as . Because is a linear function of this Gaussian random variable for large data records , it too will have a Gaussian PDF.

(ex: , , y is Gaussian.)

A

7.4 How to find the MLE?7.4 How to find the MLE?

◊ The MLE for a scalar parameter is defined to be the

value of that maximizes for x is fixed, i.e. ,

the value that maximizes the likelihood function.

◊ Since will also be a function of x ,the maximization produces a that is a function of x.

◊ Example 7.3 - DC Level in white Gaussian Noise◊

where is WGN with unknown variance A.

◊ To actually find the MLE for this problem we first write the

PDF from (7.1) as

◊ Differentiating the log-likelihood function , we have

2

12

0

1 1; exp

22N

N

n

p A x n AAA

x

1 1

2

20 0

ln ; 1 1

2 2

N N

n n

p A Nx n A x n A

A A A A

x

◊ Setting it equal to zero produces

◊ We choose the solution

to correspond to the permissible range of A or A>0.

◊ Not only does the maximum likelihood procedure yield an

estimator that is asymptotically efficient , it also sometimes

yields an efficient estimator for finite data records.

◊ Example 7.4 - DC Level in white Gaussian Noise◊ For the received data

where A is the unknown level to be estimated and is

WGN with known variance , the PDF is

◊ Taking the derivative of the log-likelihood function produces

2

12

22 0

1 1; exp

22N

N

n

p A x n A

x

1

20

ln ; 1 N

n

p Ax n A

A

x

◊ Which being set equal to zero yields the MLE

◊ This result is true in general . If an efficient estimator exists ,

the maximum likelihood procedure will produce it.

proof:

因為 efficient estimator 存在 , 所以

依照 maximum likelihood procedure ,

令得

1

0

1ˆN

n

A x nN

7.5 Properties of the MLE7.5 Properties of the MLE

◊ The example discussed in Section 7.3 led to an estimator that for large data records was unbiased , achieved the CRLB ,and had a Gaussian PDF ,the MLE was distributed as

◊ Invariance property (MLE for transformed parameters).

◊ Of course , in practice it is seldom known in advance how large N must be in order for (7.8) to hold.

◊ An analytical expression for the PDF of the MLE is usually impossible to derive. As an alternative means of assessing performance, a computer simulation is usually required.

1ˆ , (7.8)a

I N

◊ Example 7.5 - DC Level in white Gaussian Noise◊ A computer simulation was performed to determine how large

the data record had to be for the asymptotic results to apply.

◊ In principle the exact PDF of (see (7.6)) could be found but

would be extremely tedious.

1

2

0

1 1 1ˆ (7.6)2 4

N

n

A x nN

◊ Using the Monte Carlo method ,M=1000 realizations of

were generated for various data record lengths.◊ The mean and variance were estimated by

◊ Instead of the CRLB of (7.2),we tabulate

◊ For a value of A equal to 1 the results are shown in Table 7.1.

2

12

ˆvarA

N AA

2

12

ˆvar . (7.2)A

AN A

◊ To check this the number of realizations was increased to

M=5000 for a data record length of N=20.This resulted in the

mean and normalized variance shown in parentheses.

◊ Next , the PDF of was determined using a Monte Carlo

Computer simulation .This was done for data record lengths

of N=5 and N=20 (M=5000).

Theorem 7.1Theorem 7.1

◊ Theorem 7.1 (Asymptotic Properties of the MLE)

If the PDF p(x; θ) of the data x satisfies some “regularity” conditions, then the MLE of the unknown parameter θ is asymptotically distributed (for large data records) according to

where I(θ) is the Fisher information evaluated at the true value of the unknown parameter.

1ˆ ,a

I N

2

2

ln ;pI E

x

◊ Regularity condition:

◊ From the asymptotic distribution, the MLE is seen to be asymptotically unbiased and asymptotically attains the CRLB.

◊ It is therefore asymptotically efficient, and hence asymptotically optimal.

2

2

ln ;0

pE for all

x

7.5 Properties of the MLE7.5 Properties of the MLE

◊ Example 7.6 – MLE of the Sinusoidal Phase◊ We wish to estimate the phase of a sinusoid embedded in noise

or

where w[n] is WGN with variance σ2 and the amplitude A and frequency f0 are assumed to be known.

◊ We saw in Chapter 5 that no single sufficient statistic exists for this problem.

0cos 2 0,1, , 1x n A f n w n n N

Cont.

◊ The sufficient statistics were

1

1 00

1

2 00

cos 2

sin 2 .

N

n

N

n

T x n f n

T x n f n

x

x

2

1 2

12 2

0 1 222 0

, ,

12

20

;

1 1exp cos 2 2 cos 2 sin

22

1exp

2

N

N

n

g T T

N

n

h

p

A f n AT AT

x n

x x

x

x

x x

◊ The MLE is found by maximizing p(x; ) or

or, equivalently, by minimizing

◊ Differentiating with respect to produces

2

12

022 0

1 1; exp cos 2

22N

N

n

p x n A f n

x

1

2

00

cos 2 .N

n

J x n A f n

1

0 00

2 cos 2 sin 2N

n

Jx n A f n A f n

◊ Setting it equal to zero yields

◊ But the right-hand side may be approximated since

for f0 not near 0 or 1/2. (P.33)

1 1

0 0 00 0

ˆ ˆ ˆsin 2 sin 2 cos 2 .N N

n n

x n f n A f n f n

1 1

0 0 00 0

1 1ˆ ˆ ˆsin 2 cos 2 sin 4 2 02

N N

n n

f n f n f nN N

sin 2 2sin cos

◊ Thus, the left-hand side when divided by N and set equal to zero will produce an approximate MLE, which satisfies

1

00

1 1

0 00 0

1

001

00

ˆ sin 2 0.

ˆ ˆsin 2 cos cos 2 sin

sin 2ˆ arctan .

cos 2

N

n

N N

n n

N

nN

n

x n f n

x n f n x n f n

x n f n

x n f n

1 2 1 2 1 2sin( ) sin cos cos sin

◊ According to Theorem 7.1, the asymptotic PDF of the phase estimator is

◊ From Example 3.4

so that the asymptotic variance is

where is the SNR.

1ˆ , .a

I N

2

22

NAI

2

22

1 1ˆvarAN N

2 22A

◊ To determine the data record length for the asymptotic mean and variance to apply we performed a computer simulation using A = 1, f0 = 0.08, = π/4, σ2 = 0.05.

var<CRLB!!Maybe bias.

◊ We fixed the data record length N = 80 and varied the SNR.

◊ In Figure 7.4 we have plotted 10

ˆ10log var .

10 10

10 10

1ˆ10log var 10log

10log 10log .

N

N

◊ The large error estimates are said to be outliers and cause the threshold effect.

◊ Nonlinear estimators nearly always exhibit this effect.

◊ In summary, the asymptotic PDF of the MLE is valid for large enough data records.

◊ For signal in noise problems the CRLB may be attained even for short data records if the SNR is high enough.

◊ To see why this is so the phase estimator can be written as example 7.6

1

0 001

0 00

1

00

1

00

cos 2 sin 2ˆ arctan

cos 2 cos 2

sin sin 22

arctancos cos 2

2

N

nN

n

N

nN

n

A f n w n f n

A f n w n f n

NAw n f n

NAw n f n

1

0 00

1

00

1 ˆ ˆsin 2 cos 2

1 ˆsin 4 2 02

N

n

N

n

f n f nN

f nN

◊ If the data record is large and/or the sinusoidal power is large, the noise terms is small. It is this condition, the estimation error will be small, that allows the MLE to attain its asymptotic distribution.

◊ In some cases the asymptotic distribution does not hold, no matter how large the data record and/or the SNR becomes.

1

001

00

2sin sin 2

ˆ arctan .2

cos cos 2

N

nN

n

w n f nNA

w n f nNA

◊ Example 7.7 – DC Level in Non-independent Non-Gaussian Noise◊ Consider the observations

◊ The PDF is symmetric about w[n] =0 and has a maximum at w[n] = 0. Furthermore, we assume all the noise samples are equal or w[0]=w[1]=…=w[N-1]. In estimate A, we need to consider only a single observation (x[0]=A+w[0]) since all observation are identical.

◊ The MLE of A is the value that maximizes because

, we can get

◊ This estimator has the mean .

0 00 ; 0x wp x A p x A ˆ 0 .A x

ˆ 0E A E x A

w 0 0 -p x A

◊ The variance of is the same as the variance of x[0] or of w[0].

◊ The CRLB (problem 3.2)

the two are not in general equal. (see Problem 7.16)◊ So in this sample, the estimator error does not decrease as the

data record length increase but remains the same.

A

20

ˆvar wA u p u du

12

0

0

ˆvarw

w

dp uA du

p u

7.6 MLE for Transformed Parameters7.6 MLE for Transformed Parameters

Example 7.8 – Transformed DC Level in WGN◊ Consider the data

where w[n] is WGN with variance σ2.◊ We wish to find the MLE of ◊ The PDF is given as

0,1, , 1x n A w n n N

exp A

2

12

22 0

1 1; exp

22N

N

n

p A x n A A

x

◊ However, since α is a one-to-one transformation of A, we can equivalently parameterize the PDF as

◊ Clearly, is the PDF of the data set

◊ Setting the derivative of with respect to α equal to zero yields

2

12

22 0

1 1; exp ln 0

22N

N

Tn

p x n

x

;Tp x

ln 0,1, , 1x n w n n N

1

0

1ˆ ˆln 0 exp .

ˆ

N

n

x n or x

;Tp x

◊ But is just the MLE of A, so that

◊ The MLE of the transformed parameter is found by substituting the MLE of the original parameter in to the transformation.

◊ This property of the MLE is termed the invariance property.

x

ˆ ˆˆ exp exp .A

◊ Example 7.9 – Transformed DC Level in WGN ◊ Consider the transformation for the data set in the

previous example.

◊ Attempting to parameterize p(x;A) with respect to α, we find that

since the transformation is not one-to-one.

◊ If we choose , then some of the possible PDFs will be missing.

2A

A A

◊ We actually require two sets of PDFs (7.23)

to characterize all possible PDFs.

◊ It is possible to find the MLE of α as the value of α that yields the maximum of and or

1 2

2 2

1 2

22 0

1 2

22 0

1 1; exp 0

22

1 1; exp 0

22

N

N

N

Tn

N

Tn

p x n A

p x n A

x

x

1

;Tp x 2

;Tp x

1 2

ˆ arg max ; , ;T Tp p

x x

◊ Alternatively, we can find the maximum in two steps as◊ For a given value of , say , determine whether

or is large. For example, if

then denote the value of as .

Repeat for all to form .◊ The MLE is given as the that maximizes over

0

1;Tp x

2

;Tp x

1 20 0; ;T Tp p x x

1 0;Tp x 0;Tp x

0 ;Tp x

;Tp x

0.

Construction of modified likelihood functionConstruction of modified likelihood function

◊ The function can be thought of as a modified likelihood function, having been derived from the original likelihood function by transforming the value of A that yields the maximum value for a given .

◊ The MLE is ：

;Tp x

0

2

0

2

2

2

ˆ arg max ; , ;

arg max ; , ;

arg max ;

ˆA

p p

p p

p A

A

x

x x

x x

x

Theorem 7.2Theorem 7.2

◊ Theorem 7.2 (Invariance Property of the MLE)◊ The MLE of the parameter α= g (θ), where the PDF p(x;θ) is

parameterized by θ, is given by

where is the MLE of θ. The MLE of is obtained by maximizing p(x;θ).

◊ If g is not a one-to-one function, then maximizes the modified likelihood function , defined as

ˆˆ g

;Tp x

:

; max ;Tg

p p

x x

7.6 MLE for Transformed Parameters7.6 MLE for Transformed Parameters

◊ Example 7.10 – Power of WGN in dB◊ We observe N samples of WGN with variance σ2 whose power in

dB is to be estimated.

◊ To do so we first find the MLE of σ2. Then, we use the invariance principle to find the power P in dB, which is defined as

◊ The PDF is given by

21010log .P

2

12 2

22 0

1 1; exp

22N

N

n

p x n

x

Cont.

◊ Differentiating the log-likelihood function produces

◊ Setting it equal to zero yields the MLE

◊ The MLE of the power in dB readily follows as

2 12 2

2 2 20

12

2 40

ln ; 1ln 2 ln

2 2 2

1

2 2

N

n

N

n

p N Nx n

Nx n

x

1

2 2

0

1ˆ .

N

n

x nN

210

12

100

ˆ ˆ10log

110log .

N

n

P

x nN

7.7 Numerical Determination of the MLE7.7 Numerical Determination of the MLE

◊ A distinct advantage of the MLE is that we can always find it for a given data set numerically.

◊ The safest way to find the MLE is to grid search, as long as the spacing between search is small enough, we are guaranteed to find the MLE

◊ If, however, the range is not confirmed to a finite interval, then a grid search may not be computationally feasible.

◊ We use iterative maximization procedures : Newton-Raphson method The scoring approach The expectation-maximization algorithm

◊ These methods will produce the MLE if the initial guess is close to the true maximum. If not, convergence may not be attained, or only convergence to a local maximum.

The Newton-Raphson methodThe Newton-Raphson method

◊ This is a nonlinear equation and can not be solved directly.

◊ Consider ：

◊ Guess ：

ln ; ln ;0.

p pg

x x

0

0 0

dgg g

d

0

0

0 0

dgg g

d

1 .

k

kk k

g

dg

d

◊ Note that at convergence , we get ,1k k 0kg

12

1 2

ln ; ln ;

k

k k

p p

x x

◊ The iteration may not converge ,when the second derivation of the log-likelihood function is small. The correct term may fluctuate wildly.

◊ Even if the iteration converges, the point found may not be the global maximum but only a local maximum or even a local minimum.

◊ Generally, if the initial point is close to the global maximum, the iteration will converge to it.

The Method of ScoringThe Method of Scoring

◊ A second common iterative procedure is the method of scoring, it recognizes that ：

◊ Proof ：

2

2

ln ;

k

k

pI

x

22 1

2 20

21

20

2

2

ln ;ln ;

ln ;1

ln ;

N

n

N

n

p x np

p x nN

N

p x nNE

Ni

I

x

By the law of large numbers.

◊ So we get

11

ln ;.

k

k k

pI

x

◊ Example 7.11 – Exponential in WGN

◊ the parameter , the exponential factor, is to be estimated.

◊ so, we want to minimize ：

◊ differentiating and setting it equal to ：

1 2

0

.N

n

n

J r x n r

1

1

0

0.N

n n

n

x n r nr

0,1, , 1nx n r w n n N

r

2

1 2

22 0

1 1; exp

22N

Nn

n

p r x n r

x

◊ Applying the Newton-Raphson iteration method ：Cont.

1

12

0

ln ; 1 Nn n

n

p rx n r nr

r

x

2 1 12 2 2

2 20 0

12

20

ln ; 11 2 1

11 2 1

N Nn n

n n

Nn n

n

p rn n x n r n n x n r

r

nr n x n n r

x

◊ so we get

11

01 1

2

0

.1 2 1

Nn n

k kn

k k Nn n

k kn

x n r nrr r

nr n x n n r

12

1 2

ln ; ln ;

k

k k

p p

x x

◊ Applying the method of scoring ：

( )

so that

1

2 2 22

0

1 Nn

n

I n r

1

22

0

11 2 1

Nn n

n

nr n x n n r

1

1

01 1

2 2 2

0

.

Nn n

k kn

k k Nn

kn

x n r nrr r

n r

Computer SimulationComputer Simulation

◊ Consider J r

◊ Using N = 50, r = 0.5,◊ We apply the Newton-Raphson iteration by using several

initial guesses. (0.8 , 0.2 , and 1.2) ◊ For 0.2 and 0.8 the iteration quickly converged to the true

maximum. However, for r0 = 1.2 the convergence was much slower with 29 iterations.

◊ If the initial guess was less than 0.18 or greater than 1.2, the succeeding iterates exceeded 1 and keep increasing, the Newton-Raphson iteration fails to converge.

ConclusionConclusion

◊ If PDF is known, MLE can be used.◊ With MLE, the unknown parameter is estimated by ：

where is the vector of observed data. (N samples)

◊ Asymptotically unbiased ：

◊ Asymptotically efficient ：

ˆ arg max ;p

x

x

ˆlim .N

E

ˆlim var .N

CRLB

Find a , which maximum the probability.

2011 summer training course estimation theory chapter 7 maximum likelihood estimation

Documents

mvu estimator

practical estimator

valid estimator

proposed estimator

unbiased estimator of

estimator of dc level

complete sufficient

single sufficient statistic