on simulated em algorithms - ku · on simulated em algorithms estimating by 1 n 1 m gn(b') =...

On simulated EM algorithms

Soren Feodor Nielsen Department of Theoretical Statistics

University of Copenhagen

Abstract

In a paper on the EM algorithm, Ruud (1991) suggests a simulated EM algorithm. The idea is to replace the E-step by a simulated E-step, i.e. to replace the expectation by an estimate obtained by simulation.

The simulations can -at least in principle- be done in two ways. Either new independent random variables are drawn in each iteration, or the same uniforms are reused in each iteration.

In this paper the properties of these two versions of the simulated EM algorithm are discussed and compared.

Keywords: Simulation, EM algorithm

1

2 S.F. NIELSEN

1 Simulated EM algorithms

The EM algorithm (cf Dempster, Laird, and Rubin (1978)) has two steps, an E-step and an M-step. The E-step is the calculation of the conditional expectation of the complete data log-likelihood given the observed data. The M-step is a maximization of this expression. These two steps are then iterated. It is well-known that each iteration of the algorithm increases the observed data log-likelihood, and though the general convergence theory (cf Wu (1983)) is rather vague, the algorithm often works well in practice. Ruud (1991) gives an overview of the general theory and some applications to econometrics.

However in some cases the E-step of the algorithm is not practically feasible, because the conditional expectation cannot be calculated. This happens when the expectation is a large sum or when the expectation corresponds to a high-dimensional integral without a closed form expression. In this case, Ruud (1991) suggests to replace the expectation by an estimate obtained by simulation.

Let Xl, X 2 , . .. , Xn be iid random variables with density fB. Instead of observing Xi suppose we observe Yi = Y(Xi). Then the E-step of the EM-algorithm is the calculation of

(1)

This is then in the M-step maximized over 8 and the maximizer is then used as a new value of 8' in the following iteration. One iteration of the EM algorithm thus corresponds to calculating the conditional expectation (1) and maximizing it as a function of 8.

We denote the EM update, i.e. the 8-value given by one iteration ofthe EM algorithm starting in 8', by M (8'). The maximum likelihood estimator, en, is a fixed point of M. Hence, the EM algorithm finds solutions to M(8) = 8 by the method of successive substitution; from a starting value, 81 , of 8, 82 = M(8 l ) is calculated and used to calculate 83 = M(82 ) etc.

In the simulated_EM algorithm the expectation (1) is replaced by an estimate in the following way: Let Xij for j = 1, ... ,m be a random variable drawn from the conditional distribution of Xi given Yi under the distribution with parameter 8'. Then

(2)

is an unbiased estimate of Q. We can pretend that Xij = FB~l(Uij[Yi) where the Uijs are independent uniform random variables, independent of the Yis, and FB~\ [Yi) is the appropriate conditional distribution function.

It is clear that we get two different simulated EM algorithms according to whether we draw new independent random variables in each iteration or we re-use the uniforms, Uij , in each iteration.

Drawing new uniforms in each iteration, the sequence of 8-values obtained from the algorithm, (en(k)hEN, is a Markov chain, and the estimator, en is a random variable drawn from the limiting distribution of the chain. This version has been discussed in detail by Nielsen (1997) under the name of the Imputation Maximization algorithm and is closely connected to the Stochastic EM algorithm suggested by Celeux and Diebolt (1985); see Diebolt and Ip (1996) for a review.

Reusing the uniforms, we estimate the function 8 -+ M (8) once and for all and search for fixed points by the method of successive substitutions. Essentially this corresponds to

ON SIMULATED EM ALGORITHMS

estimating

by 1 n 1 m

Gn(B') = - L - L De log fe(Fe-;-l(Uij IYi))le=el n i=l m j=l

and finding the root by the method of successive substitutions. Notice that Gn(B') is an unbiased estimate of DeQ(BIB')le=el if differentiation and expectation are interchangeable.

It should be clear that at least for moderate values of m the versions differ significantly. The purpose of this paper is to compare these two simulated EM algorithms.

To keep the distinction between these two versions of the algorithm clear we will use the name IM algorithm for the version, where new independent random variables are drawn in each iteration, and call the second version, where the uniforms are reused, the SEM algorithm as done by McFadden and Ruud (1984). One should note that the supplementary EM algorithm (cf Meng and Rubin (1991)) is also referred to as the SEM algorithm, and so is the Stochastic EM algorithm (cf Celeux and Diebolt (1985)). Thus SEM is not a perfectly chosen acronym, but it should not cause any confusion in the context of this paper.

2 Asymptotic results

We begin this section by introducing some notation. All score functions to be defined below are assumed to have expectation zero and finite variance.

Let le be the density of X. Let Sx (B) be the corresponding score function and put V(B) = Ee(sx(BY2J2). Let Bo denote the true unknown value of B.

The score function corresponding to the conditional distribution of X given Y = y is denoted SXly(B). Let Iy(B) = Ee(sxly(B)@21Y = y).

Let Sy (B) be the score function corresponding to the distribution of Y and put I (B) = Ee(sy(B)@2).

Notice that sy(B) = sx(B) - Sxly(B) and that I(B) = V(B) - EeIy(B). Finally, put F(B) = EeIy(B)V(B)-r, which can be interpreted as the expected fraction of missing information (cf Dempster et al. (1977)).

Asymptotic results for the IM algorithm were discussed by Nielsen (1997, Theorem 1). He proved the following result under general regularity conditions:

Theorem 1 (Asymptotic results for the IM algorithm) If the Markov chain is ergodic and tight, then y'n(en - Bo) ~ N(O, ~m(Bo)) where

I(Bo)-l +! f F(Bo/ k V(Bo)-l EeoIy(Bo)V(Bo)-l F(Bot m k=O

1 ()-1 I(Bo)-l + m V(Bot1EeoIy(Bo)V(Bo)-1 1- F(Bo)2 (3)

where I is the identity matrix.

See Nielsen (1997) for a precise statement of the regularity conditions assumed and for a proof of (3).

3

4 S.F. NIELSEN

In order to obtain large sample results for the SEM version, we apply Corollary 3.2 and Theorem 3.3 in Pakes and Pollard (1989). McFadden and Ruud (1994) give similar results under stronger assumptions.

We note that

G(B) Eeo(Gn(B)) = Eeo (De log fe(Xij )) = EeoEeo (De log fe(Xij)IYi)

EeoEe(sx(B)IY) = Eeo (se(Y) + Ee(sx!y(B)IY))

Eeo(sy(B))

which is differentiable under usual regularity conditions with

Furthermore, by the central limit theorem

(4)

(5)

By assumption G(Bo) = 0 and we will assume that there is a neighbourhood of Bo such that Bo is the only root of G inside this neighbourhood.

From the law of large numbers, we know that Gn(B) ~ G(B) for all B. This needs to be extended to uniform convergence to obtain consistency. Also Fn(Gn(B) - G(B)) is asymptotically normal by the central limit theorem; we need this to be close to FnGn(Bo) when B is close to Bo. To be precise we need these two assumptions:

(AI) /IGn(B) - G(B)/1 P . sUPeEc 1 + /IGn(B)/1 + /IG(B)/1 -+ 0, for a compact neIghbourhood, C, of Bo·

(A2) Fn/lGn(B) - G(B) - Gn(Bo)/1 P sUPlle-eoll<8n 1 + Fn/lGn(B)/1 + Fn/lG(B)/1 -+ 0, for any bn -t O.

Stronger assumptions are obtained if the numerators are ignored. A sufficient condition for both assumptions is given in this lemma:

Lemma 1 Suppose that in a neighbourhood of Bo for some a > 0

1 log fe(Fe-I(UIY)) -log fe,(Fe~l(UIY))1 :::; 'l/;(U, Y)/IB - B'W~

(i) If Eeol'l/;(U, Y)I < 00 then (AJ) holds.

(ii) If Eeo'l/;(U, y)2 < 00 then (A2) holds.

Proof: See Lemma 2.13, Lemma 2.8 and Lemma 2.17 in Pakes and Pollard (1989). o

We say that en is an asymptotic local minimum of 11 Gn (B) 11 if for some open set W ~ e /IGn(en)/1 :::; infeEw /IGn(B)/1 + op(n-1/ 2 ). From Corollary 3.2 and Theorem 3.3 in Pakes and Pollard (1989) we get the following result:

Theorem 2 (Asymptotic results for the SEM algorithm) Under assumption (AJ)) there is an asymptotic local minimum) en) of /IGn(B)/1 such that en ~ Bo.

Under assumption (A 2)) if en is consistent) then Fn(en - Bo) ~ N(O,I(Bo)-l + ~I(Bo)-lEeoIy(Bo)I(Bo)-l)


3 A comparison

It is clear that both the SEM and the IM version of the simulated EM algorithm approximates the EM algorithm as m tends to infinity. Both versions lead to estimators with asymptotic variances tending to I(Bo)-l as m -----7 00. For finite values of m, however, the asymptotic variances differ.

Recalling that F(Bo) = EOoIy(Bo)V(Bo)-l and V(Bo) = I(Bo) + EOoIy(Bo) we find (with the usual ordering of positive definite matrices) that

:0:

:0:

I( BO)-l EOoIy (Bo)I( BO)-l

> V(BotlEooIy(Bo)V(Bo)-l (I - F(Bo)2)-1

(I - F(Bo)2) V(Bo)EooIy(Bo)-lV(Bo)

> I(Bo)EooIy(Bo)-lI(Bo)

EOoIy(Bo) + I(Bo)EooIy(Bo)-l I(Bo) + 2I(Bo) - EOoIy(Bo) > I(Bo)EooIy(Bo)-lI(Bo)

2I(Bo) > 0

Hence, the IM version leads to asymptotically better estimates -in the sense of smaller variance- than the SEM version.

This result may seem counterintuitive as new random noise is added in each step of the IM algorithm. The following example suggests an explanation.

Example 1 Let Xl, X 2, ... ,Xn be iid bivariate random variables, normally distributed with common expectation B and known variance matrix:E. Suppose only the first coordinate, Xli, of each Xi = (Xli, X2i)t is observed. The observed data MLE, en, is just the average of the XliS.

The simulated EM algorithm (with m = 1) corresponds to simulating X 2i from the conditional distribution of X 2i given Xli and then averaging all the X s, the observed as well as the simulated. This leads to

(6)

where p is the correlation coefficient of Xi and Ek cv N(O, 0-2); 0-2 is given from the known matrix :E but will not be specified here.

In the SEM case, Ek = El for all k and as k -----7 00

(7)

whereas in the IM case (6) defines an AR(l) process, so that

(8)

5

6 S.F. NIELSEN

when k ~ 00.

The (asymptotic) distribution of fo(iJn - Bo) = fo(iJn - en) + fo(en - Bo) in the SEM and the IM case is thus normal with expectation 0 and variance given by the variance of the MLE plus the variance specified above (in (7) and (8), respectively). It is not difficult to show directly that the variance of the IM estimator is smaller than the variance of the SEM estimator. Instead, looking at (6), we observe that

For the IM version the covariance term is 0, whereas it is positive for the SEM version. Thus, we see that the variance of the SEM estimator is larger exactly because lithe uniforms are reused".

Furthermore, the IM estimator can easily be improved -in the sense of reducing the variance- by averaging over the last 5-20, say, iterations. Since the iterations constitute a Mar kov chain, averaging will decrease the variance (see Nielsen (1997) for details). 0 bviously the only way to improve the SEM estimator is to increase m, i.e. to increase the simulation burden.

The SEM algorithm is a deterministic algorithm searching for a root of the (random) function G(B). Hence, if it converges, it converges deterministic ally. Little is known about the convergence of the SEM algorithm, though. There is no apparent reason to believe that Gn(B) only has one root, and the SEM algorithm stops when one is found. There may be no root at all as in the example below.

Example 2 Let Xl, X 2 , ... ,Xn be iid normally distributed with expectation 0 and variance B. Observe Xi if Xi ;::: O. The SEM algorithm (with m = 1) corresponds to iterating

where Ei are iid random variables from the standard normal distribution given that they are negative. If 2:?=1 ET . l{xi<o} ;::: n (which happens with positive probability), then iJn(k) ~ 00. If

2:?=IET·1{xi<o} < n then iJn(k) converges to 2:?=IXl·1{xi?:,:o}/(n - 2:?=IET' l{xi<o}) which may be arbitrarily far from the maximum likelihood estimator; the MLE is obtained

when 2:?=1 ET . l{xi<o} = 2:?=1 l{xi<o}, Using a contraction principle (cf Letac (1986)), it is not difficult to show that the

Markov chain of the IM algorithm is ergodic so that the IM algorithm converges. Assumptions (A1) and (A2) (in the SEM case) and tightness (in the IM case) can be

shown to hold, but the details will not be given here. Consequently, the asymptotic results of Section 2 hold.

The Markov chain of the IM algorithm is irreducible under weak assumptions, and the IM algorithm does therefore not get stuck. If it converges, i.e. if the Markov chain is ergodic, the IM algorithm converges stochastically in the sense that the sequence of distributions of the B-values obtained from the iterations converge in total variation. Thus, whereas it is a relatively simple task to see if the SEM algorithm has converged, the convergence of


the IM algorithm is much more awkward to ascertain.

It is not clear which algorithm converge faster. As indicated above, convergence of the IM algorithm is difficult to ascertain and this will typically lead to iterations after the algorithm has converged. The method of successive substitutions, which is applied in the SEM as well as in the EM algorithm, is typically slow, as is well-known in the case of the EM algorithm.

Finally, a word on the simulations. Using acceptance-rejection sampling or Markov chain Monte Carlo techniques, we can generally simulate the Xijs. However, neither technique will give us simulations suitable for the SEM version of the simulated EM algorithm, as neither method "re-uses uniforms" even if we start each iteration with the same random seed. Thus, unless we can simulate the Xijs in a non-iterative manner, the SEM algorithm is not really implementable. This problem does not occur for the IM algorithm, though it should be noted that if a MCMC simulation of the Xijs is necessary, then -for instance--:a Monte Carlo EM algorithm can be performed with the same amount of simulation, and this may well be preferable.

In conclusion, the IM algorithm has some serious theoretical advantages compared to the SEM algorithm. The asymptotic variance is smaller and the algorithm does not get stuck. On the other hand, convergence is more difficult to detect in the IM algorithm than in the SEM algorithm, and we conjecture that this may make the latter algorithm faster. Notice however that the SEM algorithm is not always implementable.

In order to apply either algorithm, some assumptions must be checked. For the SEM algorithm we need (AI) and (A2). This can typically be shown from smoothness of the simulations, if the simulations are indeed smooth. If not, more general empirical process techniques can be applied (d Pakes and Pollard (1989)). In the IM case, it is ergodicity and tightness that must be shown, typically by verifying a drift criterion; see Nielsen (1997).

4 Monte Carlo experiment

We conclude this paper with a small simulation study. McFadden and Ruud (1994) consider the following trivariate tobit model.

Let

( X) (( (3t ) ((}2 + p2 X = ~: ~ N ~:;, 6

We observe Yi = Xi1{xi>o}. The observed covariates, ti , are simulated as iid normal variates with mean 1 and variance 2. We simulate 50 replications of X with (3 = 1, P = 0.7 and (}2 = 1.51. The parameter (}2 is the conditional variance of Xl given (X2' X3). This parameterization is chosen because it makes the parameters variation independent and, consequently, the results easier to interpret.

To investigate the behaviour of the two versions of the simulated EM algorithm we have obtained 1000 estimates from each algorithm. The IM algorithm was run for 16,000 iterations and the last 1000 iterations were used. The SEM version was run 1000 times. We do this for m = 1,5,10.

The complete data MLE does not have a closed form and the M-step must be performed

7

8 S.F. NIELSEN

numerically. We have used the method of scoring with a fixed tolerance. The simulated E-step involves drawing values of (Xl, X 2 ) given that they are both

negative. This cannot be done in a non-iterative manner, and as mentioned above this means that the SEM algorithm is not really implement able. As McFadden and Ruud (1994), we have run a Gibbs sampler for 10 cycles to do this simulation in the SEM version. In the IM version, this simulation is done by acceptance-rejection sampling.

The distribution of the simulation estimators, e - eo, obtained from the SEM and the IM algorithms can be decomposed into a contribution from the simulations and a contribution from the data, e - fJ and fJ - eo, respectively. The latter -the distribution of the MLE- is a function of the data only and gives no information on how well the two simulation estimators behave. The simulation estimators, on the other hand, attempts to approximate the maximum likelihood estimator (cf. Example 1), since the simulated EM algorithm approximates the EM algorithm. Therefore, it is the distribution of e - fJ that is of interest when evaluating the performance of the simulation estimators. Hence, ~e look at the empirical distribution of the difference between the simulation estimators, e, ~btained from the SEM and the IM algorithms and the maximum likelihood estimator, e.

We give summary statistics for the estimators of each parameter (Tables 1-3), and histograms and QQ-plots of the distribution of e - fJ for each parameter and each value of m are shown. The two histograms in each figure -one for each version of the simulated EM algorithm- have the same axes and are thus directly comparable. The displayed part of the abscissae of the histograms all contain O. The MLE based on the observed data is shown in the tables for completeness.

Looking first at the results for (3 (Table 1 and Figures 1-3) we notice that the variance of the IM estimator is larger than the variance of the SEM estimator. This is probably due to the small sample size, as the results of Section 2 indicate that the asymptotic variance is smaller for the IM estimator.

The bias of the SEM estimator is however large compared to the bias of the IM estimator, and the IM estimator clearly performs better in this example in spite of the larger variance. The fact that the bias of the IM estimator in the case m = 5 is larger than when m = 1 or m = 10 is surprising but also occurs in additional simulations (not shown here).

The QQ-plots suggests that the estimators are reasonably close to normally distributed. The tails of the distributions may be a bit too heavy, and the IM estimator seems to be slightly closer to normal than the SEM estimator.

(3 : (~ = 1.042) Mean StDev 25% Median 75%

m=l: IM -0.0117 0.0196 -0.0249 -0.0115 0.0013 SEM -0.0798 0.0122 -0.0878 -0.0808 -0.0724

m=5: IM -0.0635 0.0104 -0.0702 -0.0637 -0.0565 SEM -0.0798 0.0055 -0.0838 -0.0799 -0.0762

m=10: IM -0.0118 0.0065 -0.0162 -0.0117 -0.0074 SEM -0.0797 0.0038 -0.0822 -0.0798 -0.0772

Table 1: Simulation results for ~ - ~. StDev is the standard deviation, 25% and 75% are the lower and upper quartiles

o o C\J

o o ~

o l!J

o

o o C\J

o o ~

o l!J

o

o l!J

o o

o l!J

o

o l!J

o o ~

o l!J

o

ON SIMULATED EM ALGORITHMS 9

l!J 0 0

0

~ 0

l!J 0

9

-0.10 -0.05 0.0 0.05 -2 o 2

IM Quantiles of Standard Normal

(0 0

9 :2! w (f)

0 ~

9

-0.10 -0.05 0.0 0.05 -2 o 2

SEM QuanWes of Standard Normal

Figure 1: Histograms and QQ-plots for (3 - (3, m = 1

C') 0

9 l!J 0

9 ~

r-... 0

9 Cl 0

9 -0.10 -0.08 -0.06 -0.04 -0.02 0.0 -2 o 2


-0.10 -0.08 -0.06 -0.04 -0.02 0.0 -2 o 2

SEM QuanWes of Standard Normal

Figure 2: Histograms and QQ-plots for ~ -~, m = 5

10

o LO ~

o o ~

o LO

o

o LO

o o

o LO

o

-0.10 -0.08 -0.06 -0.04 -0.02 0.0 0.02

IM

-0.10 -0.08 -0.06 -0.04 -0.02 0.0 0.02

SEM

S.F. NIELSEN

0 0

2 0 9

C') 0

9

-2 o 2

Quantiles of Standard Normal

0 r--0

9 /-.....

2 0 <Xl

W 0 (J) 9

0 Cl) 0

9 -2 o 2


Figure 3: Histograms and QQ-plots for (3 - (3, m = 10

For P we see that the bias is significantly larger for the SEM estimator than for the IM estimator, but the variances are now roughly equal. Again the IM estimator is superior. The Figures 4-6 suggest that the estimators are not too far from normally distributed.

We may note that the SEM estimator is less biased than the IM estimator when looking at the empirical distribution of p - Po rather than p - p. This is due to the large positive bias in the MLE and is thus an effect of the data and not the simulations. Consequently, we would not expect the SEM estimator to perform better than the IM estimator on all data sets, though it may perform better when p is positively biased. In particular, in large samples, where we would expect the bias of the MLE to be negligible, we would expect the IM estimator to be superior. Incidentally, this is the only parameter, e, for which e - eo is more biased in the IM case.

p : (p = 1.101) Mean StDev 25% Median 75%

m=l: IM 0.0386 0.0977 -0.0261 0.0417 0.1061 SEM -0.2122 0.0917 -0.2733 -0.2118 -0.1527

m=5: IM 0.0276 0.0408 -0.0003 0.0269 0.0553 SEM -0.2179 0.0393 -0.2444 -0.2168 -0.1908

m=10: IM 0.0372 0.0318 0.0148 0.0371 0.0595 SEM -0.2185 0.0277 -0.2361 -0.2180 -0.1996

Table 2: Simulation results for p - p. StDev is the standard deviation,

25% and 75% are the lower and upper quartiles

o l!)

o o

o l!)

o

o l!)

o o

o l!)

o

o l!)

o l!)

o

o l!)

o o

o l!)

o


C\J 0

~ 0 0

C\J

cri

-0.4 -0.2 0.0 0.2 0.4 -2 o 2


0

,.... cri

::2 UJ Cf) (")

cri

l!)

cri -0.4 -0.2 0.0 0.2 0.4 -2 o 2

SEM Quantiles of Standard Normal

Figure 4: Histograms and QQ-plots for p - p, m = 1

0

0

~ 0 0

0 ,.... cri

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 -2 o 2


l!) ,.... cri

::2 UJ l!) Cf) C\J

cri

l!) (")

cri -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 -2 b 2


Figure 5: Histograms and QQ-plots for p - p, m = 5

11

12

o o ~

o Lt)

o

o Lt)

o o ~

o Lt)

o

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

IM

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

SEM

S.F. NIELSEN

0 ~

ci

Lt) 0

~ ci

0 ci

Lt) 0

9

o C\J

15 9 UJ

-2 o 2


-2 o 2


Figure 6: Histograms and QQ-plots for j5 - p, m = 10

When estimating 0"2, the IM estimator again performs better; the bias is larger for the SEM estimator, and the variances are here smaller for the IM estimator.

Figures 7-9 shows that the distribution of if2 - iJ2 is far from normal, especially for small values of m, but this is what we would expect. Ignoring the multivariate nature of the data and the fact that 0"2 is not the only parameter, the problem of estimating 0"2 is very similar to Example 2. Thus, we would expect the SEM estimator to be distributed approximately as a/(50 - XI2m) for some constant a, since only 12 XIS are censored. In particular, if2 - iJ2 is not approximately normal for small values of m in the SEM case. The distribution of the IM estimator is more difficult to describe but since the conditional distribution of the next step of the Markov chain in Example 2 given the past is an affine transformation ofaxi2m-distribution, we would not expect if2 - iJ2 to be approximately normal for small values of m in the IM case, either.

0"2 : (iJ2 = 1.538) Mean StDev 25% Median 75%

m=l: IM -0.1198 0.1594 -0.2287 -0.1347 -0.0291 SEM 0.2712 0.1927 0.1390 0.2526 0.3787

m=5: IM -0.1703 0.0623 -0.2112 -0.1750 -0.1324 SEM 0.2792 0.0836 0.2227 0.2772 0.3318

m=10: IM -0.1052 0.0523 -0.1424 -0.1066 -0.0712 SEM 0.2809 0.0584 0.2402 0.2799 0.3216

Table 3: Simulation results for 52 - iJ2 .

StDev is the standard deviation, 25% and 75% are the lower and upper quartiles

o LO

o o

o LO

o

o LO

o o

o LO

o

o LO

o o

o LO

o

o LO

o o

o LO

o

-0.5 0.0

-0.5 0.0

0.5

IM

0.5

SEM


CD ci '<t ci

... ~

C\J ci

:2: 0 ci

'<t

si

1.0 -2 o 2


~

CD

:2: ci w (j)

C\J ci

C\J

si 1.0 -2 o 2


Figure 7: Histograms and QQ-plots for 0-2 - 8-2 ) m = 1

~

ci

0 ci ~

:2: si

C')

si

-0.4 -0.2 0.0 0.2 0.4 0.6 -2 o 2


CD ci

'<t ci

:2: w (j)

C\J ci

0 ci

-0.4 -0.2 0.0 0.2 0.4 0.6 -2 o 2


Figure 8: Histograms and QQ-plots for 0-2 - 8-2 ) m = 5

13

14

o Lt)

o o

o Lt)

o

o Lt)

o o ~

o Lt)

o

-0.2

-0.2

0.0 0.2 0.4

IM

0.0 0.2 0.4

SEM

S.F. NIELSEN

ci

0 ci

~ ~ '7 C\J

'7

'<t ci

:z cry

w ci Cl)

C\J ci

~

ci

.' .,

-2 0 2


-2 o 2


Figure 9: Histograms and QQ-plots for (5-2 - 8"2, m = 10

For all parameters, the standard deviations decrease as m increases. The decrease is roughly l/VS ~ 0.44 whim m increases from 1 to 5 and 1/V2 ~ 0.71 when going from m = 5 to m = 10, as we would expect from the asymptotic results. This fact and the distribution of (5-2 - 8"2 indicates that as m increases the empirical distribution of the simulation estimators minus the MLEs gets closer to normality. This does not mean that the asymptotic results of Section 2 hold. For instance, the distribution of (5-2 - (]"5 does not approximate a normal distribution when m increases, since the distribution of 8"2 - (]"6 is unaffected by m, and this distribution is probably not normal with the sample size in this example.

The bias of the SEM estimator is unaffected by the choice of m, whereas the mean of the IM estimator is always lower for m = 5. This leads to larger bias when m = 5 in the estimators of (3 and 8"2, but smaller in the positively biased estimator of p. As mentioned previously this shows up in additional simulations and is thus not "accidental". Whether it is an effect of the observations or a more general phenomenon remains to be seen.

The components of the IM estimator appears to be independent whereas the SEM estimators of p and (]"2 are negatively correlated; the correlation coefficient is approximately -0.5, independent of m. Hence, some care should be exercised when interpreting the results for p and (]"2.

In conclusion, the IM algorithm performs better than the SEM algorithm in this example, even though the sample size is to small for the asymptotic results to hold. The SEM algorithm is faster than the IM algorithm in this example, but the number of iterations used in the IM version is probably a lot larger than what is necessary for convergence. We might hope to improve the SEM algorithm by increasing m but as we


have seen the main problem with the SEM estimator is bias rather than variance, and the bias does not seem to decrease for moderate values of m. One could also hope to improve the SEM algorithm in this example by allowing more iterations in the Gibbs sampler used for simulating (Xl, X 2 ) given that they are both negative. However, since only 7 observations have both Xl and X 2 censored, the effect of this will probably not be very large.

Acknowledgment

I am grateful to Richard Gill for calling my attention to the subject of simulated EM algorithms.

References

DEMPSTER A.P., LAIRD N.M., AND RUBIN D.B. (1977). Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm (with discussion). J. Roy. Statist. Soc., B 39, 1-38.

CELEUX, G. AND DIEBOLT, J. (1986). The SEM Algorithm: A Probabilistic Teacher Algorithm Derived From the EM Algorithm for the Mixture Problem. CmpStQ 2, 73-82.

DIEBOLT, J. AND lp, E.H.S. (1996). Stochastic EM: method and application. In: Markpv Chain Monte Carlo in Practice (W.R. Gilks, S. Richardson, D.J. Spiegelhalter, eds.) Chapman & Hall, London.

LETAC, G. (1986). A contraction principle for certain Markov chains and its applications. Contemporary Mathematics 50, 263-273.

McFADDEN, D. AND RUUD P.A. (1994). Estimation By Simulation. Review of Economics and Statistics 76, 591-608.

MENG, X.-L. AND RUBIN, D.B. (1991). Using EM to Obtain Asymptotic Variance-covariance Matrices: The SEM Algorithm. JASA 86, 899-909.

NIELSEN, S.F. (1997). The Imputation Maximization Algorithm. Manuscript.

PAKES, A. AND POLLARD, D. (1989). Simulation and the Asymptotics of Optimization Estimators. Econometrica57, 1027-1057.

RUUD, P.A. (1991). Extensions of Estimation Methods Using the EM Algorithm. 1. Econometrics 49, 305-341.

Wu, C.F.J. (1983). On the Convergence Properties of the EM Algorithm. Ann Statist 11, 95-103.

15

Preprints 1996

COPIES OF PREPRINTS ARE OBTAINABLE FROM THE AUTHOR

OR FROM THE INSTITUTE OF MATHEMATICAL STATISTICS,

UNIVERSITETSPARKEN 5, DK-2100 COPENHAGEN 0, DENMARK.

TELEPHONE 45 35 32 08 99, FAX 45 35 3207 72.

No. 1

No. 2

No. 3

No. 4

No. 5

Lando, David: Modelling Bonds and Derivatives with Default Risk.

Johansen, S0ren: Statistical analysis of some non-stationary time series.

J0rgensen, C., Kongsted, H.C. and Rahbek, A.: Trend-Stationarity in the I(2) Cointegration Model.

Paruolo, Paolo and Rahbek, Anders C.: Weak Exogeneity in I(2) Systems.

Christensen, Peter Ove, Lando, David and Miltersen, Kristian R.: State-Dependent Realignments in Target Zone Currency Regimes.

Preprints 1997

COPIES OF PREPRINTS ARE OBTAINABLE FROM THE AUTHOR

OR FROM THE INSTITUTE OF MATHEMATICAL STATISTICS,

UNIVERSITETSPARKEN 5, DK-2100 COPENHAGEN 0, DENMARK.

TELEPHONE 45 35 32 08 99, FAX 45 35 3207 72.

No. 1 Nielsen, Soren Feodor: Coarsening at Random. No. 2 Nielsen, Soren Feodor: The Imputation Maximization Algorithm. No. 3 Nielsen, Soren Feodor: On Simulated EM Algorithms.

on simulated em algorithms - ku · on simulated em algorithms estimating by 1 n 1 m gn(b') =...

Documents