an elementary derivation of the maximum likelihood estimator of the covariance matrix, and an...

2
Automatica, Vol. 27, No. 2, pp. 425-426, 1991 Printed in Great Britain. 0005-1098/91 $3.00 + 0.00 Pergamon Press plc 1991 International Federation of Automatic Control Technical Communique An Elementary Derivation of the Maximum Likelihood Estimator of the Covariance Matrix, and an lllustrative Determinant Inequality* SEPPO KARRILAt$ and TAPIO WESTERLUND% Key Words--Maximum likelihood estimation; estimation; determinants, optimization; least-squares estimation. Abstract--The unique maximum likelihood estimate of the covariance matrix of normally distributed random vectors is derived by use of elementary linear algebra leading to simple scalar equations. In addition the application of a determinant inequality, also derived here, shows that a standard "derivation" of the maximum likelihood estimate is fallacious. Introduction IN SOME textbooks on estimation theory [for example Goodwin and Payne (1977), p. 48] the maximum likelihood estimate of the covariance matrix of normally distributed random vectors is obtained by matrix differentiation results for general matrices, not restricting the covariance matrix to being symmetric positive definite (SPD) or even just symmetric. A stationary point of the likelihood function is obtained--the stationary point is then observed to be SPD and it is concluded that this must be the unique solution to the maximization problem in the smaller domain of SPD matrices. Although the solution is correct its derivation is not, and some confusion may arise since the likelihood function attains arbitrarily large values when the covariance matrix is not restricted to being SPD. Naturally the stationary point obtained for general matrices was in fact a saddle point, and some further insight about the situation is provided by the monotonicity result presented for deter- minants here. The maximum likelihood estimate The likelihood function of N independent normally distributed random vectors e with n real components is given by L = (2•) -(N/2)n IR-tr N/2 exp (-½tr {ErER-1)} (1) where R is the unknown covariance matrix. The matrix E is formed from the observations according to E T = [el, e2 ..... eN]. (2) The standard way of obtaining the maximum likelihood estimates is by differentiating the logarithm of the likelihood function with respect to the estimated parameters using matrix differentiation rules for general matrices. This gives [see Goodwin and Payne (1977), p. 48, eqn 3.3.10] alnL NR_ ~ aR 2 + ½R-1ErER -1. (3) * Received 19 December 1989; received in final form 12 June 1990. Recommended for publication in revised form by Editor W. S. Levine. t Department of Chemical Engineering, ,~bo Akademi, Biskopsgatan 8, SF-20500 ,~bo, Finland. ~tAuthor to whom all correspondence should be addressed. The maximum likelihood estimate of R is given by the root = 1 ErE. (4) it The saddle point nature of this symmetric root is seen as follows. We perturb the inverse solution with a real nonzero skew-symmetric matrix H = -Hr: R 1 = N(ErE)-l + AH (5) where A is a real scalar. Observe that the trace within the exponential in equation (1) is unchanged by this perturba- tion, since the trace of a sum is the sum of traces, and the skew-symmetric real matrix EHE r has zero diagonal elements whereby tr (ErEH) = tr (EHE r) = 0. Therefore t ^-1 IR -1 + AHI N/2 L(R )=L(R )( ift_tl ) (6) Also the determinant inequality Ill -t + AHI > Ill ~1 (7) holds for all A 4:0 as !1 -t is SPD (see the Appendix) so that L(R -1) > L(R-t). (8) The likelihood function will thus be increased (monotonically with respect to IAI) by perturbations about the stationary point with skew-symmetric matrices, and the stationary point given by equation (4) is only a saddle point (for general matrices as the domain). A simple solution The maximum likelihood estimate can be obtained rigorously by using differentiation rules for symmetric matrices (Graybill, 1983). However, the following elemen- tary and concise approach is more appealing especially for classroom use. Constrain R to be SPD and assume ErE is invertible so that it is also SPD. Then, square roots of these matrices are defined (uniquely by requiring them to be SPD). Define the matrix A = (ETE)I/2R-I(ETE)I/2 > 0 (9) and note that it has the same trace as ErER -1. The determinant of A is related to that of R by IAI = IErEI IR-11. (10) Now maximization of equation (1) is equivalent to maximizing f = IAIN/2 exp {-½ tr {A}} (11) with respect to A where A is SPD. Let A1, )-2.... , A, be the eigenvalues of A--being SPD A 425 AUTO 27:2-N

Upload: seppo-karrila

Post on 15-Jun-2016

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An elementary derivation of the maximum likelihood estimator of the covariance matrix, and an illustrative determinant inequality

Automatica, Vol. 27, No. 2, pp. 425-426, 1991 Printed in Great Britain.

0005-1098/91 $3.00 + 0.00 Pergamon Press plc

1991 International Federation of Automatic Control

Technical Communique

An Elementary Derivation of the Maximum Likelihood Estimator of the Covariance Matrix,

and an lllustrative Determinant Inequality*

SEPPO KARRILAt$ and TAPIO WESTERLUND%

Key Words--Maximum likelihood estimation; estimation; determinants, optimization; least-squares estimation.

Abstract--The unique maximum likelihood estimate of the covariance matrix of normally distributed random vectors is derived by use of elementary linear algebra leading to simple scalar equations. In addition the application of a determinant inequality, also derived here, shows that a standard "derivation" of the maximum likelihood estimate is fallacious.

Introduction IN SOME textbooks on estimation theory [for example Goodwin and Payne (1977), p. 48] the maximum likelihood estimate of the covariance matrix of normally distributed random vectors is obtained by matrix differentiation results for general matrices, not restricting the covariance matrix to being symmetric positive definite (SPD) or even just symmetric. A stationary point of the likelihood function is obtained--the stationary point is then observed to be SPD and it is concluded that this must be the unique solution to the maximization problem in the smaller domain of SPD matrices. Although the solution is correct its derivation is not, and some confusion may arise since the likelihood function attains arbitrarily large values when the covariance matrix is not restricted to being SPD. Naturally the stationary point obtained for general matrices was in fact a saddle point, and some further insight about the situation is provided by the monotonicity result presented for deter- minants here.

The max imum likelihood estimate The likelihood function of N independent normally

distributed random vectors e with n real components is given by

L = (2• ) -(N/2)n IR-tr N/2 exp ( -½tr {ErER-1)} (1)

where R is the unknown covariance matrix. The matrix E is formed from the observations according to

E T = [el, e2 . . . . . eN]. (2)

The standard way of obtaining the maximum likelihood estimates is by differentiating the logarithm of the likelihood function with respect to the estimated parameters using matrix differentiation rules for general matrices. This gives [see Goodwin and Payne (1977), p. 48, eqn 3.3.10]

a l n L N R _ ~ aR 2 + ½R-1ErER -1. (3)

* Received 19 December 1989; received in final form 12 June 1990. Recommended for publication in revised form by Editor W. S. Levine.

t Department of Chemical Engineering, ,~bo Akademi, Biskopsgatan 8, SF-20500 ,~bo, Finland.

~tAuthor to whom all correspondence should be addressed.

The maximum likelihood estimate of R is given by the root

= 1 ErE. (4) it

The saddle point nature of this symmetric root is seen as follows. We perturb the inverse solution with a real nonzero skew-symmetric matrix H = - H r :

R 1 = N ( E r E ) - l + AH (5)

where A is a real scalar. Observe that the trace within the exponential in equation (1) is unchanged by this perturba- tion, since the trace of a sum is the sum of traces, and the skew-symmetric real matrix EHE r has zero diagonal elements whereby tr (ErEH) = tr (EHE r ) = 0. Therefore

t ^-1 IR -1 + AHI N/2 L(R ) = L ( R )( ift_tl ) (6)

Also the determinant inequality

Ill - t + AHI > Ill ~1 (7)

holds for all A 4:0 as !1 - t is SPD (see the Appendix) so that

L(R -1) > L(R-t ) . (8)

The likelihood function will thus be increased (monotonically with respect to IAI) by perturbations about the stationary point with skew-symmetric matrices, and the stationary point given by equation (4) is only a saddle point (for general matrices as the domain).

A simple solution The maximum likelihood estimate can be obtained

rigorously by using differentiation rules for symmetric matrices (Graybill, 1983). However, the following elemen- tary and concise approach is more appealing especially for classroom use.

Constrain R to be SPD and assume E r E is invertible so that it is also SPD. Then, square roots of these matrices are defined (uniquely by requiring them to be SPD). Define the matrix

A = (ETE)I/2R-I(ETE)I/2 > 0 (9)

and note that it has the same trace as E r E R -1. The determinant of A is related to that of R by

IAI = IErEI IR-11. (10)

Now maximization of equation (1) is equivalent to maximizing

f = IAI N/2 exp {-½ tr {A}} (11)

with respect to A where A is SPD. Let A1, )-2 . . . . , A, be the eigenvalues of A--being SPD A

425 AUTO 27:2-N

Page 2: An elementary derivation of the maximum likelihood estimator of the covariance matrix, and an illustrative determinant inequality

426 Technical Communique

can be diagonalized and all its eigenvalues are positive. Then o

f = Z i exp -½ ki = H x7:2e-~a'/2). (12) -- i = 1 / i = l

(This equation would be equally valid for general symmetric matrices A (or R) and considering negative ~.i, clearly shows that no global maximum would ever be attained.) The stationary point is now obtained from the last expression by considering the factors separately:

d).N%-~xi/2) d ~ i -- ,~N/2)-le--(; t i /2)(~) =O. (13)

For the allowed eigenvalues 3. i ~ ]0, c¢[ the unique solution of equation (13) is

).i = N, Vi. (14)

Since the derivative of each factor changes sign just once, from positive to negative, the stationary point obtained is the global maximum (within the domain considered).

Now all the eigenvalues of A are equal. Note that the only matrix similar to a multiple of the identity is that multiple itself, and

A = N I = (E rE)R - t . (15)

The unique SPD maximum likelihood estimate of R is therefore

= 1 ErE. (16)

Discussion An elementary proof for the maximum likelihood estimate

of the covariance matrix, for normally distributed random vectors, was presented. This proof, it is hoped, will supersede the less rigorous but more complicated proofs in some current textbooks. Aside from the main theme a determinant inequality, that the authors have not managed to find in the literature, is derived in the Appendix and used for illustrating the importance of proper consideration of the domain in optimization problems.

References Goodwin, G. C. and R. L. Payne (1977). Dynamic System

Identification. Experimental Design and Data Analysis. Academic Press, New York.

Graybill, F. A. (1983). Matrices with Applications in Statistics. Wadsworth, Belmont, CA.

Appendix Let S be a real symmetric positive definite matrix and H be

some nonzero real skew-symmetric matrix (H = - H r ) . Here we show that

IS + AHI > ISI (A.1)

for all values of the real scalar A:#0, and in fact monotonically increases with the absolute magnitude of this perturbation parameter. (The reader may observe that the same proof is valid for the skew-Hermitian perturbation of a Hermitian matrix in the complex case, provided that absolute values of the determinants are taken).

Observe that

S + A H = Sl/2(! + AS-1/2HS-1/2)81/2 (A. 2)

and by the product rule for determinants

IS + AHI = ISI" I1 + AGI (a.3)

with G = S-v2HS -1/2. (A.4)

Since G is skew-symmetric its eigenvalues are purely imaginary, and these are shifted by unity when the identity matrix is added:

IS + AHI = ISl I~I (1 + i A~.j), (A.5) ] = 1

where i).~, j= 1 . . . . . n are the eigenvalues of G and i = ~ / - 1. The product on the RHS is pure real since the LHS is, so taking the absolute value will at most change the sign. Shifting the absolute value to the factors of the product gives

n

ISI I J ~ , (m.6) /=1

which obviously is monotonically increasing with IAI, strictly so since at least one of the eigenvalues is nonzero. Due to continuity with respect to A the RHS of (A.5) cannot jump to negative values as A moves away from zero; thus it stays positive and taking the absolute value does not even change sign. This proves that the LHS of (A.5) is also monotonically increasing with respect to the absolute value of A. The weaker result

IS+ AHI > ISI (m.7)

for all real A :# 0, follows from this strict monotonicity.

Q.E.D.