2011 summer training course estimation theory chapter 7 maximum likelihood estimation
DESCRIPTION
2011 Summer Training Course ESTIMATION THEORY Chapter 7 Maximum Likelihood Estimation. Outline. Why use MLE? How to find the MLE? Properties of MLE Numerical Determination of the MLE. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Wireless Information Transmission System Lab.Wireless Information Transmission System Lab.
Institute of Communications EngineeringInstitute of Communications Engineering
National Sun Yat-sen UniversityNational Sun Yat-sen University
2011 Summer Training Course2011 Summer Training CourseESTIMATIONESTIMATION THEORYTHEORY
Chapter 7Chapter 7Maximum Likelihood EstimationMaximum Likelihood Estimation
OutlineOutline
◊ Why use MLE?
◊ How to find the MLE?
◊ Properties of MLE
◊ Numerical Determination of the MLE
2
IntroductionIntroduction
◊ We now investigate an alternative to the MVU estimator, which is desirable in situations where the MVU estimator does not exist or cannot be found even if it does exist.
◊ This estimator, which is based on the maximum likelihood principle, is overwhelmingly the most popular approach to obtaining practical estimator.
◊ It has the distinct advantage of being a turn-the-crank procedure, allowing it to be implemented for complicated estimation problems.
3
◊ In general, the MLE has the asymptotic properties of being unbiased, achieving the CRLB, and having a Gaussian PDF.
4
Why use MLE?Why use MLE?
◊ Example 7.1 - DC Level White Gaussian Noise◊ Consider the observed data set
where A is an unknown level , which is assumed to be positive
(A > 0) , and is WGN with unknown variance A
( ). ◊ The PDF is :
5
1
2
02
1 1; exp . (7.1)
22
N
Nn
p A x n AAA
x
◊ Taking the derivative of the log-likelihood function, we
have :
◊ We can still determine the CRLB for this problem to find that :
◊ We next try to find the MVU estimator by resorting to the theory of sufficient statistics.
6
1 12
20 0
?
ln ; 1 1
2 2
ˆ .
N N
n n
p A Nx n A x n A
A A A A
I A A A
x
2
2
1ˆvarln ;p
E
x
2
12
ˆvar . (7.2)A
AN A
◊ Sufficient statistics◊ Theorem 5.1(p.104)◊ Theorem 5.2(p.109)
◊ If is an unbiased estimator of and is a sufficient
statistic for then is
I. A valid estimator for ( not dependent on )
II. Unbiased
III. Of lesser or equal variance than that of , for all .
Additionally, if the sufficient statistic is complete, then is the MVU estimator. In essence, a statistic is complete if there is only one function of the statistic that is unbiased
; ,p g T h x x x
◊ First approach : Use Theorem 5.1◊ Two steps :
◊ First step : Find ◊ Second step : Find function g so that is an unbiased
estimator of A
Cont.
A = g(T)
,g T hx x
◊ First step :◊ Attempting to factor (7.1) into the form of (5.3),we note that
◊ so that the PDF factor as
◊ Based on the Neyman-Fisher factorization theorem a single sufficient statistic for A is .
2
12
0
12
0
,
1 1 1; exp exp
22N
N
n
N
nh x
g x n A
p A x n NA NxAA
x
1 1
2 2
0 0
1 12
N N
n n
x n A x n Nx NAA A
1
0
1 N
nx x n
N
◊ Second step :◊ Assuming that is a complete sufficient statistic . To do so
we need to find a function g such that
since
◊ It is not obvious how to choose .g
1
2
0
0N
n
E g x n A for all A
12 2
0
2
2
var
N
n
E x n NE x n
N x n E x n
N A A
◊ Second approach : Use Theorem 5.2◊ It would be to determine the conditional expectation
where is any unbiased estimator.
◊ Example : Let , then the MVU estimator would take the form
◊ Unfortunately , the evaluation of the conditional expectation
appears to be a formidable task.
1 2
0ˆ N
nE A x n
A
ˆ 0A x
1
2
0
0N
n
E x x n
◊ Example 7.2 - DC level in White Gaussian Noise◊ We propose the following estimator :
◊ This estimator is biased since
1
2
0
1 1 1ˆ . (7.6)2 4
N
n
A x nN
12
0
12
0
2
1 1 1ˆ2 4
1 1 1
2 4
1 1
2 4.
N
n
N
n
E A E x nN
E x nN
A A
A
◊ As ,we have by the law of large number
and therefore from (7.6)
◊ To find the mean and variance of as we use the
statistical linearization argument described in section 3.6 .
◊ Section 3.6
is a estimator of DC level (
)
1
2 2 2
0
1 N
n
x n E x n A AN
A AA
◊ It might be supposed that is efficient for .◊ Let .If we linearize about A , we have the
approximation
Then ,the estimator is unbiased.
Also, the estimator achieves the CRLB
.dg Ag x g A x A
dA
2E g x g A A
2
2 2
2 2
var var
2
4
dg Ag x x
dA
A
N
A
N
◊ The is an estimator of .
◊ Let be that function , so that
where , and therefore,
◊ Linearizing about ,we have
or
1 2
0
1 N
nx n
N
2A A
g A g u
1 2
0
1 N
nu x n
N
1 1.
2 4g u u
12
0
1 1 1ˆ .2 4
N
n
A x nN
20u E u A A
0 00
|dg ug u g u u u
u udu
11
2 22
102
1ˆ (7.7)N
n
A A x n A AA N
◊ It now follows that the asymptotic mean is
so that is asymptotically unbiased . Additionally , the
asymptotic variance becomes , from (7.7)
◊ But can be shown to be (prove is in next page), so that
ˆE A A
A
211
22
102
124
212
1ˆvar var
var .
N
n
A x nA N
x nN A
2var x n 3 24 2A A
124
212
2
12
1ˆvar 42
A A AN A
A
N A
(asymptotically
efficient)
2 4 2 2
2
22
p.574
If 0, , then the momen
va
ts of are
1 3 1 even
0 od
r
d
x
k
k x
by
x x
k kE x
x n E x n E x n
k
:
N
44
4 3 2 2 3 4
4 2 2
4 3
4 6 4
0 6 0 3
7.1:
[ ] [ ] 0,1,..., -1
~ (0
, .
6
)
EX
x n
E x n E A w
E A A w A w Aw w
A A A A
A w n n N
w N A
A A
2
2 2 2 2
22 4 3 2 2
4 3 2 4 3 2
3 2#
3
var
var 6 3
6 3 2
4 2
A
and E x n x n E x n A A
x n A A A A A
A A A A A A
A A
◊ Summarizing our result :
a. The proposed estimator given by (7.6) is asymptotically unbiased and asymptotically achieves the CRLB .Hence , it is asymptotically efficient.
b. Furthermore , by the central limit theorem the random variable
is Gaussian as . Because is a linear function of this Gaussian random variable for large data records , it too will have a Gaussian PDF.
(ex: , , y is Gaussian.)
A
7.4 How to find the MLE?7.4 How to find the MLE?
◊ The MLE for a scalar parameter is defined to be the
value of that maximizes for x is fixed, i.e. ,
the value that maximizes the likelihood function.
◊ Since will also be a function of x ,the maximization produces a that is a function of x.
◊ Example 7.3 - DC Level in white Gaussian Noise◊
where is WGN with unknown variance A.
◊ To actually find the MLE for this problem we first write the
PDF from (7.1) as
◊ Differentiating the log-likelihood function , we have
2
12
0
1 1; exp
22N
N
n
p A x n AAA
x
1 1
2
20 0
ln ; 1 1
2 2
N N
n n
p A Nx n A x n A
A A A A
x
◊ Setting it equal to zero produces
◊ We choose the solution
to correspond to the permissible range of A or A>0.
◊ Not only does the maximum likelihood procedure yield an
estimator that is asymptotically efficient , it also sometimes
yields an efficient estimator for finite data records.
◊ Example 7.4 - DC Level in white Gaussian Noise◊ For the received data
where A is the unknown level to be estimated and is
WGN with known variance , the PDF is
◊ Taking the derivative of the log-likelihood function produces
2
12
22 0
1 1; exp
22N
N
n
p A x n A
x
1
20
ln ; 1 N
n
p Ax n A
A
x
◊ Which being set equal to zero yields the MLE
◊ This result is true in general . If an efficient estimator exists ,
the maximum likelihood procedure will produce it.
proof:
因為 efficient estimator 存在 , 所以
依照 maximum likelihood procedure ,
令 得
1
0
1ˆN
n
A x nN
7.5 Properties of the MLE7.5 Properties of the MLE
◊ The example discussed in Section 7.3 led to an estimator that for large data records was unbiased , achieved the CRLB ,and had a Gaussian PDF ,the MLE was distributed as
◊ Invariance property (MLE for transformed parameters).
◊ Of course , in practice it is seldom known in advance how large N must be in order for (7.8) to hold.
◊ An analytical expression for the PDF of the MLE is usually impossible to derive. As an alternative means of assessing performance, a computer simulation is usually required.
1ˆ , (7.8)a
I N
◊ Example 7.5 - DC Level in white Gaussian Noise◊ A computer simulation was performed to determine how large
the data record had to be for the asymptotic results to apply.
◊ In principle the exact PDF of (see (7.6)) could be found but
would be extremely tedious.
1
2
0
1 1 1ˆ (7.6)2 4
N
n
A x nN
◊ Using the Monte Carlo method ,M=1000 realizations of
were generated for various data record lengths.◊ The mean and variance were estimated by
◊ Instead of the CRLB of (7.2),we tabulate
◊ For a value of A equal to 1 the results are shown in Table 7.1.
2
12
ˆvarA
N AA
2
12
ˆvar . (7.2)A
AN A
◊ To check this the number of realizations was increased to
M=5000 for a data record length of N=20.This resulted in the
mean and normalized variance shown in parentheses.
◊ Next , the PDF of was determined using a Monte Carlo
Computer simulation .This was done for data record lengths
of N=5 and N=20 (M=5000).
Theorem 7.1Theorem 7.1
◊ Theorem 7.1 (Asymptotic Properties of the MLE)
If the PDF p(x; θ) of the data x satisfies some “regularity” conditions, then the MLE of the unknown parameter θ is asymptotically distributed (for large data records) according to
where I(θ) is the Fisher information evaluated at the true value of the unknown parameter.
1ˆ ,a
I N
2
2
ln ;pI E
x
◊ Regularity condition:
◊ From the asymptotic distribution, the MLE is seen to be asymptotically unbiased and asymptotically attains the CRLB.
◊ It is therefore asymptotically efficient, and hence asymptotically optimal.
2
2
ln ;0
pE for all
x
7.5 Properties of the MLE7.5 Properties of the MLE
◊ Example 7.6 – MLE of the Sinusoidal Phase◊ We wish to estimate the phase of a sinusoid embedded in noise
or
where w[n] is WGN with variance σ2 and the amplitude A and frequency f0 are assumed to be known.
◊ We saw in Chapter 5 that no single sufficient statistic exists for this problem.
0cos 2 0,1, , 1x n A f n w n n N
Cont.
◊ The sufficient statistics were
1
1 00
1
2 00
cos 2
sin 2 .
N
n
N
n
T x n f n
T x n f n
x
x
2
1 2
12 2
0 1 222 0
, ,
12
20
;
1 1exp cos 2 2 cos 2 sin
22
1exp
2
N
N
n
g T T
N
n
h
p
A f n AT AT
x n
x x
x
x
x x
◊ The MLE is found by maximizing p(x; ) or
or, equivalently, by minimizing
◊ Differentiating with respect to produces
2
12
022 0
1 1; exp cos 2
22N
N
n
p x n A f n
x
1
2
00
cos 2 .N
n
J x n A f n
1
0 00
2 cos 2 sin 2N
n
Jx n A f n A f n
◊ Setting it equal to zero yields
◊ But the right-hand side may be approximated since
for f0 not near 0 or 1/2. (P.33)
1 1
0 0 00 0
ˆ ˆ ˆsin 2 sin 2 cos 2 .N N
n n
x n f n A f n f n
1 1
0 0 00 0
1 1ˆ ˆ ˆsin 2 cos 2 sin 4 2 02
N N
n n
f n f n f nN N
sin 2 2sin cos
◊ Thus, the left-hand side when divided by N and set equal to zero will produce an approximate MLE, which satisfies
1
00
1 1
0 00 0
1
001
00
ˆ sin 2 0.
ˆ ˆsin 2 cos cos 2 sin
sin 2ˆ arctan .
cos 2
N
n
N N
n n
N
nN
n
x n f n
x n f n x n f n
x n f n
x n f n
1 2 1 2 1 2sin( ) sin cos cos sin
◊ According to Theorem 7.1, the asymptotic PDF of the phase estimator is
◊ From Example 3.4
so that the asymptotic variance is
where is the SNR.
1ˆ , .a
I N
2
22
NAI
2
22
1 1ˆvarAN N
2 22A
◊ To determine the data record length for the asymptotic mean and variance to apply we performed a computer simulation using A = 1, f0 = 0.08, = π/4, σ2 = 0.05.
var<CRLB!!Maybe bias.
◊ We fixed the data record length N = 80 and varied the SNR.
◊ In Figure 7.4 we have plotted 10
ˆ10log var .
10 10
10 10
1ˆ10log var 10log
10log 10log .
N
N
◊ The large error estimates are said to be outliers and cause the threshold effect.
◊ Nonlinear estimators nearly always exhibit this effect.
◊ In summary, the asymptotic PDF of the MLE is valid for large enough data records.
◊ For signal in noise problems the CRLB may be attained even for short data records if the SNR is high enough.
◊ To see why this is so the phase estimator can be written as example 7.6
1
0 001
0 00
1
00
1
00
cos 2 sin 2ˆ arctan
cos 2 cos 2
sin sin 22
arctancos cos 2
2
N
nN
n
N
nN
n
A f n w n f n
A f n w n f n
NAw n f n
NAw n f n
1
0 00
1
00
1 ˆ ˆsin 2 cos 2
1 ˆsin 4 2 02
N
n
N
n
f n f nN
f nN
◊ If the data record is large and/or the sinusoidal power is large, the noise terms is small. It is this condition, the estimation error will be small, that allows the MLE to attain its asymptotic distribution.
◊ In some cases the asymptotic distribution does not hold, no matter how large the data record and/or the SNR becomes.
1
001
00
2sin sin 2
ˆ arctan .2
cos cos 2
N
nN
n
w n f nNA
w n f nNA
◊ Example 7.7 – DC Level in Non-independent Non-Gaussian Noise◊ Consider the observations
◊ The PDF is symmetric about w[n] =0 and has a maximum at w[n] = 0. Furthermore, we assume all the noise samples are equal or w[0]=w[1]=…=w[N-1]. In estimate A, we need to consider only a single observation (x[0]=A+w[0]) since all observation are identical.
◊ The MLE of A is the value that maximizes because
, we can get
◊ This estimator has the mean .
0 00 ; 0x wp x A p x A ˆ 0 .A x
ˆ 0E A E x A
w 0 0 -p x A
◊ The variance of is the same as the variance of x[0] or of w[0].
◊ The CRLB (problem 3.2)
the two are not in general equal. (see Problem 7.16)◊ So in this sample, the estimator error does not decrease as the
data record length increase but remains the same.
A
20
ˆvar wA u p u du
12
0
0
ˆvarw
w
dp uA du
p u
7.6 MLE for Transformed Parameters7.6 MLE for Transformed Parameters
Example 7.8 – Transformed DC Level in WGN◊ Consider the data
where w[n] is WGN with variance σ2.◊ We wish to find the MLE of ◊ The PDF is given as
0,1, , 1x n A w n n N
exp A
2
12
22 0
1 1; exp
22N
N
n
p A x n A A
x
◊ However, since α is a one-to-one transformation of A, we can equivalently parameterize the PDF as
◊ Clearly, is the PDF of the data set
◊ Setting the derivative of with respect to α equal to zero yields
2
12
22 0
1 1; exp ln 0
22N
N
Tn
p x n
x
;Tp x
ln 0,1, , 1x n w n n N
1
0
1ˆ ˆln 0 exp .
ˆ
N
n
x n or x
;Tp x
◊ But is just the MLE of A, so that
◊ The MLE of the transformed parameter is found by substituting the MLE of the original parameter in to the transformation.
◊ This property of the MLE is termed the invariance property.
x
ˆ ˆˆ exp exp .A
◊ Example 7.9 – Transformed DC Level in WGN ◊ Consider the transformation for the data set in the
previous example.
◊ Attempting to parameterize p(x;A) with respect to α, we find that
since the transformation is not one-to-one.
◊ If we choose , then some of the possible PDFs will be missing.
2A
A A
◊ We actually require two sets of PDFs (7.23)
to characterize all possible PDFs.
◊ It is possible to find the MLE of α as the value of α that yields the maximum of and or
1 2
2 2
1 2
22 0
1 2
22 0
1 1; exp 0
22
1 1; exp 0
22
N
N
N
Tn
N
Tn
p x n A
p x n A
x
x
1
;Tp x 2
;Tp x
1 2
ˆ arg max ; , ;T Tp p
x x
◊ Alternatively, we can find the maximum in two steps as◊ For a given value of , say , determine whether
or is large. For example, if
then denote the value of as .
Repeat for all to form .◊ The MLE is given as the that maximizes over
0
1;Tp x
2
;Tp x
1 20 0; ;T Tp p x x
1 0;Tp x 0;Tp x
0 ;Tp x
;Tp x
0.
Construction of modified likelihood functionConstruction of modified likelihood function
◊ The function can be thought of as a modified likelihood function, having been derived from the original likelihood function by transforming the value of A that yields the maximum value for a given .
◊ The MLE is :
;Tp x
0
2
0
2
2
2
ˆ arg max ; , ;
arg max ; , ;
arg max ;
ˆA
p p
p p
p A
A
x
x x
x x
x
Theorem 7.2Theorem 7.2
◊ Theorem 7.2 (Invariance Property of the MLE)◊ The MLE of the parameter α= g (θ), where the PDF p(x;θ) is
parameterized by θ, is given by
where is the MLE of θ. The MLE of is obtained by maximizing p(x;θ).
◊ If g is not a one-to-one function, then maximizes the modified likelihood function , defined as
ˆˆ g
;Tp x
:
; max ;Tg
p p
x x
7.6 MLE for Transformed Parameters7.6 MLE for Transformed Parameters
◊ Example 7.10 – Power of WGN in dB◊ We observe N samples of WGN with variance σ2 whose power in
dB is to be estimated.
◊ To do so we first find the MLE of σ2. Then, we use the invariance principle to find the power P in dB, which is defined as
◊ The PDF is given by
21010log .P
2
12 2
22 0
1 1; exp
22N
N
n
p x n
x
Cont.
◊ Differentiating the log-likelihood function produces
◊ Setting it equal to zero yields the MLE
◊ The MLE of the power in dB readily follows as
2 12 2
2 2 20
12
2 40
ln ; 1ln 2 ln
2 2 2
1
2 2
N
n
N
n
p N Nx n
Nx n
x
1
2 2
0
1ˆ .
N
n
x nN
210
12
100
ˆ ˆ10log
110log .
N
n
P
x nN
7.7 Numerical Determination of the MLE7.7 Numerical Determination of the MLE
◊ A distinct advantage of the MLE is that we can always find it for a given data set numerically.
◊ The safest way to find the MLE is to grid search, as long as the spacing between search is small enough, we are guaranteed to find the MLE
◊ If, however, the range is not confirmed to a finite interval, then a grid search may not be computationally feasible.
◊ We use iterative maximization procedures : Newton-Raphson method The scoring approach The expectation-maximization algorithm
◊ These methods will produce the MLE if the initial guess is close to the true maximum. If not, convergence may not be attained, or only convergence to a local maximum.
The Newton-Raphson methodThe Newton-Raphson method
◊ This is a nonlinear equation and can not be solved directly.
◊ Consider :
◊ Guess :
ln ; ln ;0.
p pg
x x
0
0 0
dgg g
d
0
0
0 0
dgg g
d
1 .
k
kk k
g
dg
d
◊ Note that at convergence , we get ,1k k 0kg
12
1 2
ln ; ln ;
k
k k
p p
x x
◊ The iteration may not converge ,when the second derivation of the log-likelihood function is small. The correct term may fluctuate wildly.
◊ Even if the iteration converges, the point found may not be the global maximum but only a local maximum or even a local minimum.
◊ Generally, if the initial point is close to the global maximum, the iteration will converge to it.
The Method of ScoringThe Method of Scoring
◊ A second common iterative procedure is the method of scoring, it recognizes that :
◊ Proof :
2
2
ln ;
k
k
pI
x
22 1
2 20
21
20
2
2
ln ;ln ;
ln ;1
ln ;
N
n
N
n
p x np
p x nN
N
p x nNE
Ni
I
x
By the law of large numbers.
◊ So we get
11
ln ;.
k
k k
pI
x
◊ Example 7.11 – Exponential in WGN
◊ the parameter , the exponential factor, is to be estimated.
◊ so, we want to minimize :
◊ differentiating and setting it equal to :
1 2
0
.N
n
n
J r x n r
1
1
0
0.N
n n
n
x n r nr
0,1, , 1nx n r w n n N
r
2
1 2
22 0
1 1; exp
22N
Nn
n
p r x n r
x
◊ Applying the Newton-Raphson iteration method :Cont.
1
12
0
ln ; 1 Nn n
n
p rx n r nr
r
x
2 1 12 2 2
2 20 0
12
20
ln ; 11 2 1
11 2 1
N Nn n
n n
Nn n
n
p rn n x n r n n x n r
r
nr n x n n r
x
◊ so we get
11
01 1
2
0
.1 2 1
Nn n
k kn
k k Nn n
k kn
x n r nrr r
nr n x n n r
12
1 2
ln ; ln ;
k
k k
p p
x x
◊ Applying the method of scoring :
( )
so that
1
2 2 22
0
1 Nn
n
I n r
1
22
0
11 2 1
Nn n
n
nr n x n n r
1
1
01 1
2 2 2
0
.
Nn n
k kn
k k Nn
kn
x n r nrr r
n r
Computer SimulationComputer Simulation
◊ Consider J r
◊ Using N = 50, r = 0.5,◊ We apply the Newton-Raphson iteration by using several
initial guesses. (0.8 , 0.2 , and 1.2) ◊ For 0.2 and 0.8 the iteration quickly converged to the true
maximum. However, for r0 = 1.2 the convergence was much slower with 29 iterations.
◊ If the initial guess was less than 0.18 or greater than 1.2, the succeeding iterates exceeded 1 and keep increasing, the Newton-Raphson iteration fails to converge.
ConclusionConclusion
◊ If PDF is known, MLE can be used.◊ With MLE, the unknown parameter is estimated by :
where is the vector of observed data. (N samples)
◊ Asymptotically unbiased :
◊ Asymptotically efficient :
ˆ arg max ;p
x
x
ˆlim .N
E
ˆlim var .N
CRLB
Find a , which maximum the probability.