expectation-maximization (em) algorithm

Expectation-Maximization (EM) Algorithm

Md. Rezaul Karim

Professor Department of Statistics

University of RajshahiBangladesh

September 21, 2012

2

Basic Concept (1)

Dr. M. R. Karim, Stats, R.U.

EM algorithm stands for “Expectation- Maximization” algorithm

A parameter estimation method: it falls into the general framework of maximum - likelihood estimation (MLE)

The general form was given in Dempster, Laird, and Rubin (1977), although essence of the algorithm appeared previously in various forms.

3

Basic Concept (2)


The EM algorithm is a broadly applicable iterative procedure for computing maximum likelihood estimates in problems with incomplete data.

The EM algorithm consists of two conceptually distinct steps at each iteration: o the Expectation or E-step and o the Maximization or M-step

Details can be found: Hartley (1958), Dempster et al. (1977), Little and Rubin (1987) and McLachlan and Krishnan (1997)

4

Formulation of the EM Algorithm (1)


Suppose we have a model for a set of complete data Y , with associated density ( | )f Y , where 1 2= ( , , , )d is a vector of unknown parameters with parameter space .

We write ( , )obs misY Y Y

where obsY represent the observed part of Y and misY denotes the missing values

Y = (Yobs, Ymis)• Complete data Y (e.g., what we’d like to have!)• Observed data Yobs (e.g., what we have)• Missing data Ymis (e.g., incomplete/unobserved)

5



The EM algorithm is designed to find the value of , denoted , that maximizes the incomplete data log-likelihood

log ( ) log ( | )obsL f Y , that is, the MLE of θ based on the observed data obsY

The EM algorithm starts with an initial value (0) . Suppose that ( )k denotes the estimate of at the kth iteration; then the (k+1)st iteration can be described in two steps as follows:

6



E-step: Find the conditional expected complete-data log likelihood given observed data and ( )k :

( ) ( )

( )

| log ( , | ,

log ( | ) ( | , )

k kobs

kmis obs mis

Q E L Y Y

L Y f Y Y dY

which, in the case of linear exponential family, amounts to estimating the sufficient statistics for the complete data.

M-step: Determine ( 1)k to be a value of that maximizes ( )| kQ

The MLE of is found by iterating between the E and M steps until a convergence criterion is met.

7



Guess ofunknown

parameters

initialguess

M step

Observed data structure

Guess of unknown/hidden data structure

and Q function

E step

8



In some cases, it may not be numerically feasible to find the value of that globally maximizes the function ( )| kQ in the M-step.

9



In such situations, a Generalized EM (GEM) algorithm (Dempster et al. 1977) is used to choose ( 1)k in the M-step such that the condition

( 1) ( ) ( ) ( )| |k k k kQ Q

holds. For any EM or GEM algorithm, the change from ( )kto ( 1)k increases the likelihood; that is,

( 1) ( )log logk kL L

which follows from the definition of GEM and Jensen's inequality (See , Rao 1972, p.47).

10



This fact implies that the log-likelihood, log ( )L , increases monotonically on any iteration sequence generated by the EM algorithm, which is the fundamental property for the convergence of the algorithm.

Detailed properties of the algorithm, including the convergence properties, are given in o Dempster et al. (1977), o Wu (1983), o Redner and Walker (1984), and o McLachlan and Krishnan (1997)

11



Methods for obtaining the asymptotic variance-covariance matrix of the EM-computed estimator are derived by o Meng and Rubin (1991), o Louis (1982) and o Oakes (1999)

12

Multinomial Example (1)


The data relate to a problem of estimation of linkage in genetics discussed by Rao (1 973, pp. 368-369).

One considers data in which 197 animals are distributed multinomially into four categories with cell-probabilities {1/2+θ/4, (1−θ)/4, (1−θ)/4, θ/4} for some unknown [0,1] The observed number in each cell was Yobs = (125, 18, 20, 34)

Observed data

Probability

13



The density of the observed data is

1 2 3 4

1 2 3 4

! 1 1 1( | )! ! ! ! 2 4 4 4 4

y y y y

obsnf y

y y y y

The log-likelihood function for θ is therefore, apart from an additive term not involving θ,

1 2 3 4( | ) log(2 ) ( ) log(1 ) log( )obsl y y y y y

Differentiating w.r.t. θ, we have that

2 31 4( | )2 1

obsl y y yy y

14



Although the log-likelihood can be maximized explicitly we use the example to illustrate the EM algorithm.

To view the problem as an unobserved data problem we would think of it as a multinomial experiment with five categories with observations

Y = (y11, y12, y2, y3, y4),

each with cell probability (1/2, θ/4, (1−θ)/4, (1−θ)/4, θ/4).

That is, we split the first category into two, and we can only observe the sum y1 = y11 +y12. Then y11 and y12 are considered as the unobservable variables.

15



n=197

y1=125

y11

1/2

y12

θ/4

y2=18

(1-θ)/4

y3=20

(1-θ)/4

y4=34

θ/4

Observed data

Probability

Missing data

16



The density of the complete data is then

11 12 2 3 4

11 12 2 3 4

! 1 1 1( | )! ! ! ! ! 2 4 4 4 4

y y y y ynf yy y y y y

and the log-likelihood is (apart from a term not involving θ)

12 4 2 3( | ) ( ) log( ) ( ) log(1 )l y y y y y

Since y12 is unobservable we cannot maximize this directly. This obstacle is overcome by the E-step (as it handles the problem of filling in for unobservable data by averaging the complete-data log likelihood over its conditional distribution given the observed data).

17



Let (0) be an initial guess for θ. The E-step requires computation of

(0)

(0)

(0)

(0)

11 12 2 3 4 1 2 3 4

12 1 4 2 3

| ( , ) |

= ( , , , , , ) | , , ,

= | log( ) ( ) log(1 )

c obs

c

Q E l y y

E l y y y y y y y y y

E y y y y y

Thus, we need to compute the conditional expectation of y12 given y1 given (0) . But this is a Binomial distribution with sample size y1 and probability parameter

(0)

(0)

/ 41/ 2 / 4

p

y1=125y11

1/2

y12

θ/4

18



Hence, the expected value is

(0)

(0)(0)

12 1 1 12(0)

/ 4| . (say)1/ 2 / 4

E y y y y

and the expression for (0)|Q is

(0) (0)12 4 2 3| log( ) ( ) log(1 )Q y y y y

In the M-step we maximize this with respect to θ to get (0)

(1) 12 4(0)12 2 3 4

y yy y y y

19



Then, iterating this gives us finally the estimate for θ. Summarizing, we get the iterations

( )( 1) 12 4

( )12 2 3 4

ii

i

y yy y y y

where ( )

( )12 1 ( )

/ 4.1/ 2 / 4

ii

iy y

20

Flowchart for EM Algorithm


( )E-step: Compute | kQ

( ) *M-step: Maximize | kQ

( 1) *Set k

( 1) ( ) ?k k

( 1)ˆ k

( )Set 0; Initialize kk

1k k

Yes

No

21

R function for the Example: (1) (y1, y2, y3, y4 are the observed frequencies)


EM.Algo = function(y1, y2, y3, y4, tol, start0) { n = y1+y2+y3+y4; theta.current = start0;

theta.last = 0; theta = theta.current; while (abs(theta.last - theta) > tol ){ y12 = E.step(theta.current, y1) theta = M.step(y12, y2, y3, y4, n) theta.last = theta.current theta.current = theta log.lik = y1*log(2+theta.current) +(y2+y3)*log(1-theta.current)+

y4*log(theta.current) cat(c(theta.current, log.lik), '\n') }}

22

R function for the Example (2)


M.step = function(y12, y2, y3, y4, n){ return((y12+y4)/(y12+y2+y3+y4))} E.step = function(theta.current, y1){ y12 =

y1*(theta.current/4)/(0.5+theta.current/4);

return(c(y12))} # Results: EM.Algo(125, 18, 20, 34, 10^(-7), 0.50)

23

R function for the Example (3)


Iteration (k)

0 0.5000000 64.629741 0.6082474 67.320172 0.6243210 67.382923 0.6264889 67.384084 0.6267773 67.384105 0.6268156 67.384106 0.6268207 67.384107 0.6268214 67.384108 0.6268215 67.38410

( )log kL ( )k

ˆ 0.6268215

24Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (1)

• In an EM algorithm, the E-step may be difficult to implement because of difficulty in computing the expectation of log likelihood.

• Wei and Tanner (1990a, 1990b) suggest a Monte Carlo approach by simulating the missing data Z from the conditional distribution k(z | y, θ(k)) on the E-step of the (k + 1)th iteration


Monte Carlo EM (2)• Then maximizing the approximate conditional

expectation of the complete-data log likelihood

• The limiting form of this as m tends to ∞ is the actual Q(θ; θ(k))

m

jjmisobsc

k yyLm

Q1

)()( ),|(log1)|(


Monte Carlo EM (3) Application of MCEM in the previous example:• A Monte Carlo EM solution would replace the

expectation

with the empirical average

where zj are simulated from a binomial distribution with size y1 and probability

m

jjm

i zm

zy1

)(12

1

( )( )12 1 ( )

/ 4.1/ 2 / 4

ii

iy y

( )

( )

/ 41/ 2 / 4

i

i


Monte Carlo EM (4) Application of MCEM in the previous example:

• The R code for the E-step becomes

E.step = function(theta.current, y1){ bprob = (theta.current/4)/(0.5+theta.current/4) zm = rbinom(10000, y1, bprob) y12 = sum(zm)/10000 return(c(y12))}


Applications of EM algorithm (1)EM algorithm is frequently used for – Data clustering (the assignment of a set of

observations into subsets, called clusters, so that observations in the same cluster are similar in some sense) used in many fields, including machine learning, computer vision, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics

Natural language processing (NLP is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages)


Applications of EM algorithm (2) Psychometrics (the field of study concerned with

the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits.)

Medical image reconstruction, especially in positron emission tomography (PET) and single photon emission computed tomography (SPECT)


Applications of EM algorithm (3)More applications regarding data analysis examples are – Multivariate Data with Missing Values

o Example: Bivariate Normal Data with Missing Values Least Squares with Missing Data

o Example: Linear Regression with Missing Dependent Valueso Example: Missing Values in a Latin Square Design

Example: Multinomial with Complex Cell Structure Example: Analysis of PET and SPECT Data Example: Mixture distributions Example: Grouped, Censored and Truncated Data

o Example: Grouped Log Normal Datao Example: Lifetime distributions for censored data


Advantages of EM algorithm (1) The EM algorithm is numerically stable, with each

EM iteration increasing the likelihood Under fairly general conditions, the EM algorithm

has reliable global convergence (depends on initial value and likelihood!). Convergence is nearly always to a local maximizer.

The EM algorithm is typically easily implemented, because it relies on complete data computations

The EM algorithm is generally easy to program, since no evaluation of the likelihood nor its derivatives is involved


Advantages of EM algorithm (2) The EM algorithm requires small storage space and can

generally be carried out on a small computer (it does not have to store the information matrix nor its inverse at any iteration).

The M-step can often be carried out using standard statistical packages in situations where the complete-data MLE’s do not exist in closed form.

By watching the monotone increase in likelihood over iterations, it is easy to monitor convergence and programming errors.

The EM algorithm can be used to provide estimated values of the “missing” data.


Criticisms of EM algorithm Unlike the Fisher’s scoring method, it does not have an

inbuilt procedure for producing an estimate of the covariance matrix of the parameter estimates.

The EM algorithm may converge slowly even in some seemingly innocuous problems and in problems where there is too much ‘incomplete information’.

The EM algorithm like the Newton-type methods does not guarantee convergence to the global maximum when there are multiple maxima (in this case, the estimate obtained depends upon the initial value).

In some problems, the E-step may be analytically intractable, although in such situations there is the possibility of effecting it via a Monte Carlo approach.


References (1)1. Dempster AP, Laird NM, Rubin, DB (1977) Maximum

likelihood from incomplete data via the EM algorithm (with discussion). J Royal Statist Soc - B 39:1–38

2. Hartley HQ (1958) Maximum likelihood estimation from incomplete data. Biometrics 14:174-194

3. Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York

4. Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J Royal Statist Soc - B 44:226–233

5. McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. John Wiley & Sons, Inc., New York


References (2)

6. Meng XL, Rubin DB (1991) Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J Am Statist Assoc 86:899-909

7. Oakes D (1999) Direct calculation of the information matrix via the EM algorithm. J Royal Statist Soc - B 61:479-482

8. Rao CR (1972) Linear Statistical Inference and its Applications. John Wiley & Sons, Inc., New York

9. Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195-239


References (3)

10. Wei, G.C.G. and Tanner, M.A. (1990a). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85, 699-704.

11. Wei, G.C.G. and Tanner, M.A. (1990b). Posterior computations for censored regression data. Journal of the American Statistical Association 85, 829-839.

12. Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Statist 11:95-103


Thank You

expectation-maximization (em) algorithm

Documents

current theta

y1 theta

n theta

current y4

y1 y2 y3 y4 theta

current y2 y3

start0 theta

theta tol