expectation-maximization (em) algorithm

37
Expectation-Maximization (EM) Algorithm Md. Rezaul Karim Professor Department of Statistics University of Rajshahi Bangladesh September 21, 2012

Upload: sal

Post on 24-Feb-2016

150 views

Category:

Documents


0 download

DESCRIPTION

Expectation-Maximization (EM) Algorithm. Md. Rezaul Karim Professor Department of Statistics University of Rajshahi Bangladesh September 21, 2012. Basic Concept (1). EM algorithm stands for “ Expectation- Maximization ” algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Expectation-Maximization  (EM) Algorithm

Expectation-Maximization (EM) Algorithm

Md. Rezaul Karim

Professor Department of Statistics

University of RajshahiBangladesh

September 21, 2012

Page 2: Expectation-Maximization  (EM) Algorithm

2

Basic Concept (1)

Dr. M. R. Karim, Stats, R.U.

EM algorithm stands for “Expectation- Maximization” algorithm

A parameter estimation method: it falls into the general framework of maximum - likelihood estimation (MLE)

The general form was given in Dempster, Laird, and Rubin (1977), although essence of the algorithm appeared previously in various forms.

Page 3: Expectation-Maximization  (EM) Algorithm

3

Basic Concept (2)

Dr. M. R. Karim, Stats, R.U.

The EM algorithm is a broadly applicable iterative procedure for computing maximum likelihood estimates in problems with incomplete data.

The EM algorithm consists of two conceptually distinct steps at each iteration: o the Expectation or E-step and o the Maximization or M-step

Details can be found: Hartley (1958), Dempster et al. (1977), Little and Rubin (1987) and McLachlan and Krishnan (1997)

Page 4: Expectation-Maximization  (EM) Algorithm

4

Formulation of the EM Algorithm (1)

Dr. M. R. Karim, Stats, R.U.

Suppose we have a model for a set of complete data Y , with associated density ( | )f Y , where 1 2= ( , , , )d is a vector of unknown parameters with parameter space .

We write ( , )obs misY Y Y

where obsY represent the observed part of Y and misY denotes the missing values

Y = (Yobs, Ymis)• Complete data Y (e.g., what we’d like to have!)• Observed data Yobs (e.g., what we have)• Missing data Ymis (e.g., incomplete/unobserved)

Page 5: Expectation-Maximization  (EM) Algorithm

5

Formulation of the EM Algorithm (2)

Dr. M. R. Karim, Stats, R.U.

The EM algorithm is designed to find the value of , denoted , that maximizes the incomplete data log-likelihood

log ( ) log ( | )obsL f Y , that is, the MLE of θ based on the observed data obsY

The EM algorithm starts with an initial value (0) . Suppose that ( )k denotes the estimate of at the kth iteration; then the (k+1)st iteration can be described in two steps as follows:

Page 6: Expectation-Maximization  (EM) Algorithm

6

Formulation of the EM Algorithm (3)

Dr. M. R. Karim, Stats, R.U.

E-step: Find the conditional expected complete-data log likelihood given observed data and ( )k :

( ) ( )

( )

| log ( , | ,

log ( | ) ( | , )

k kobs

kmis obs mis

Q E L Y Y

L Y f Y Y dY

which, in the case of linear exponential family, amounts to estimating the sufficient statistics for the complete data.

M-step: Determine ( 1)k to be a value of that maximizes ( )| kQ

The MLE of is found by iterating between the E and M steps until a convergence criterion is met.

Page 7: Expectation-Maximization  (EM) Algorithm

7

Formulation of the EM Algorithm (4)

Dr. M. R. Karim, Stats, R.U.

Guess ofunknown

parameters

initialguess

M step

Observed data structure

Guess of unknown/hidden data structure

and Q function

E step

Page 8: Expectation-Maximization  (EM) Algorithm

8

Formulation of the EM Algorithm (5)

Dr. M. R. Karim, Stats, R.U.

In some cases, it may not be numerically feasible to find the value of that globally maximizes the function ( )| kQ in the M-step.

Page 9: Expectation-Maximization  (EM) Algorithm

9

Formulation of the EM Algorithm (6)

Dr. M. R. Karim, Stats, R.U.

In such situations, a Generalized EM (GEM) algorithm (Dempster et al. 1977) is used to choose ( 1)k in the M-step such that the condition

( 1) ( ) ( ) ( )| |k k k kQ Q

holds. For any EM or GEM algorithm, the change from ( )kto ( 1)k increases the likelihood; that is,

( 1) ( )log logk kL L

which follows from the definition of GEM and Jensen's inequality (See , Rao 1972, p.47).

Page 10: Expectation-Maximization  (EM) Algorithm

10

Formulation of the EM Algorithm (7)

Dr. M. R. Karim, Stats, R.U.

This fact implies that the log-likelihood, log ( )L , increases monotonically on any iteration sequence generated by the EM algorithm, which is the fundamental property for the convergence of the algorithm.

Detailed properties of the algorithm, including the convergence properties, are given in o Dempster et al. (1977), o Wu (1983), o Redner and Walker (1984), and o McLachlan and Krishnan (1997)

Page 11: Expectation-Maximization  (EM) Algorithm

11

Formulation of the EM Algorithm (8)

Dr. M. R. Karim, Stats, R.U.

Methods for obtaining the asymptotic variance-covariance matrix of the EM-computed estimator are derived by o Meng and Rubin (1991), o Louis (1982) and o Oakes (1999)

Page 12: Expectation-Maximization  (EM) Algorithm

12

Multinomial Example (1)

Dr. M. R. Karim, Stats, R.U.

The data relate to a problem of estimation of linkage in genetics discussed by Rao (1 973, pp. 368-369).

One considers data in which 197 animals are distributed multinomially into four categories with cell-probabilities {1/2+θ/4, (1−θ)/4, (1−θ)/4, θ/4} for some unknown [0,1] The observed number in each cell was Yobs = (125, 18, 20, 34)

Observed data

Probability

Page 13: Expectation-Maximization  (EM) Algorithm

13

Multinomial Example (2)

Dr. M. R. Karim, Stats, R.U.

The density of the observed data is

1 2 3 4

1 2 3 4

! 1 1 1( | )! ! ! ! 2 4 4 4 4

y y y y

obsnf y

y y y y

The log-likelihood function for θ is therefore, apart from an additive term not involving θ,

1 2 3 4( | ) log(2 ) ( ) log(1 ) log( )obsl y y y y y

Differentiating w.r.t. θ, we have that

2 31 4( | )2 1

obsl y y yy y

Page 14: Expectation-Maximization  (EM) Algorithm

14

Multinomial Example (3)

Dr. M. R. Karim, Stats, R.U.

Although the log-likelihood can be maximized explicitly we use the example to illustrate the EM algorithm.

To view the problem as an unobserved data problem we would think of it as a multinomial experiment with five categories with observations

Y = (y11, y12, y2, y3, y4),

each with cell probability (1/2, θ/4, (1−θ)/4, (1−θ)/4, θ/4).

That is, we split the first category into two, and we can only observe the sum y1 = y11 +y12. Then y11 and y12 are considered as the unobservable variables.

Page 15: Expectation-Maximization  (EM) Algorithm

15

Multinomial Example (4)

Dr. M. R. Karim, Stats, R.U.

n=197

y1=125

y11

1/2

y12

θ/4

y2=18

(1-θ)/4

y3=20

(1-θ)/4

y4=34

θ/4

Observed data

Probability

Missing data

Page 16: Expectation-Maximization  (EM) Algorithm

16

Multinomial Example (5)

Dr. M. R. Karim, Stats, R.U.

The density of the complete data is then

11 12 2 3 4

11 12 2 3 4

! 1 1 1( | )! ! ! ! ! 2 4 4 4 4

y y y y ynf yy y y y y

and the log-likelihood is (apart from a term not involving θ)

12 4 2 3( | ) ( ) log( ) ( ) log(1 )l y y y y y

Since y12 is unobservable we cannot maximize this directly. This obstacle is overcome by the E-step (as it handles the problem of filling in for unobservable data by averaging the complete-data log likelihood over its conditional distribution given the observed data).

Page 17: Expectation-Maximization  (EM) Algorithm

17

Multinomial Example (6)

Dr. M. R. Karim, Stats, R.U.

Let (0) be an initial guess for θ. The E-step requires computation of

(0)

(0)

(0)

(0)

11 12 2 3 4 1 2 3 4

12 1 4 2 3

| ( , ) |

= ( , , , , , ) | , , ,

= | log( ) ( ) log(1 )

c obs

c

Q E l y y

E l y y y y y y y y y

E y y y y y

Thus, we need to compute the conditional expectation of y12 given y1 given (0) . But this is a Binomial distribution with sample size y1 and probability parameter

(0)

(0)

/ 41/ 2 / 4

p

y1=125y11

1/2

y12

θ/4

Page 18: Expectation-Maximization  (EM) Algorithm

18

Multinomial Example (7)

Dr. M. R. Karim, Stats, R.U.

Hence, the expected value is

(0)

(0)(0)

12 1 1 12(0)

/ 4| . (say)1/ 2 / 4

E y y y y

and the expression for (0)|Q is

(0) (0)12 4 2 3| log( ) ( ) log(1 )Q y y y y

In the M-step we maximize this with respect to θ to get (0)

(1) 12 4(0)12 2 3 4

y yy y y y

Page 19: Expectation-Maximization  (EM) Algorithm

19

Multinomial Example (8)

Dr. M. R. Karim, Stats, R.U.

Then, iterating this gives us finally the estimate for θ. Summarizing, we get the iterations

( )( 1) 12 4

( )12 2 3 4

ii

i

y yy y y y

where ( )

( )12 1 ( )

/ 4.1/ 2 / 4

ii

iy y

Page 20: Expectation-Maximization  (EM) Algorithm

20

Flowchart for EM Algorithm

Dr. M. R. Karim, Stats, R.U.

( )E-step: Compute | kQ

( ) *M-step: Maximize | kQ

( 1) *Set k

( 1) ( ) ?k k

( 1)ˆ k

( )Set 0; Initialize kk

1k k

Yes

No

Page 21: Expectation-Maximization  (EM) Algorithm

21

R function for the Example: (1) (y1, y2, y3, y4 are the observed frequencies)

Dr. M. R. Karim, Stats, R.U.

EM.Algo = function(y1, y2, y3, y4, tol, start0) { n = y1+y2+y3+y4; theta.current = start0;

theta.last = 0; theta = theta.current; while (abs(theta.last - theta) > tol ){ y12 = E.step(theta.current, y1) theta = M.step(y12, y2, y3, y4, n) theta.last = theta.current theta.current = theta log.lik = y1*log(2+theta.current) +(y2+y3)*log(1-theta.current)+

y4*log(theta.current) cat(c(theta.current, log.lik), '\n') }}

Page 22: Expectation-Maximization  (EM) Algorithm

22

R function for the Example (2)

Dr. M. R. Karim, Stats, R.U.

M.step = function(y12, y2, y3, y4, n){ return((y12+y4)/(y12+y2+y3+y4))} E.step = function(theta.current, y1){ y12 =

y1*(theta.current/4)/(0.5+theta.current/4);

return(c(y12))} # Results: EM.Algo(125, 18, 20, 34, 10^(-7), 0.50)

Page 23: Expectation-Maximization  (EM) Algorithm

23

R function for the Example (3)

Dr. M. R. Karim, Stats, R.U.

Iteration (k)

0 0.5000000 64.629741 0.6082474 67.320172 0.6243210 67.382923 0.6264889 67.384084 0.6267773 67.384105 0.6268156 67.384106 0.6268207 67.384107 0.6268214 67.384108 0.6268215 67.38410

( )log kL ( )k

ˆ 0.6268215

Page 24: Expectation-Maximization  (EM) Algorithm

24Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (1)

• In an EM algorithm, the E-step may be difficult to implement because of difficulty in computing the expectation of log likelihood.

• Wei and Tanner (1990a, 1990b) suggest a Monte Carlo approach by simulating the missing data Z from the conditional distribution k(z | y, θ(k)) on the E-step of the (k + 1)th iteration

Page 25: Expectation-Maximization  (EM) Algorithm

25Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (2)• Then maximizing the approximate conditional

expectation of the complete-data log likelihood

• The limiting form of this as m tends to ∞ is the actual Q(θ; θ(k))

m

jjmisobsc

k yyLm

Q1

)()( ),|(log1)|(

Page 26: Expectation-Maximization  (EM) Algorithm

26Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (3) Application of MCEM in the previous example:• A Monte Carlo EM solution would replace the

expectation

with the empirical average

where zj are simulated from a binomial distribution with size y1 and probability

m

jjm

i zm

zy1

)(12

1

( )( )12 1 ( )

/ 4.1/ 2 / 4

ii

iy y

( )

( )

/ 41/ 2 / 4

i

i

Page 27: Expectation-Maximization  (EM) Algorithm

27Dr. M. R. Karim, Stats, R.U.

Monte Carlo EM (4) Application of MCEM in the previous example:

• The R code for the E-step becomes

E.step = function(theta.current, y1){ bprob = (theta.current/4)/(0.5+theta.current/4) zm = rbinom(10000, y1, bprob) y12 = sum(zm)/10000 return(c(y12))}

Page 28: Expectation-Maximization  (EM) Algorithm

28Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (1)EM algorithm is frequently used for – Data clustering (the assignment of a set of

observations into subsets, called clusters, so that observations in the same cluster are similar in some sense) used in many fields, including machine learning, computer vision, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics

Natural language processing (NLP is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages)

Page 29: Expectation-Maximization  (EM) Algorithm

29Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (2) Psychometrics (the field of study concerned with

the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits.)

Medical image reconstruction, especially in positron emission tomography (PET) and single photon emission computed tomography (SPECT)

Page 30: Expectation-Maximization  (EM) Algorithm

30Dr. M. R. Karim, Stats, R.U.

Applications of EM algorithm (3)More applications regarding data analysis examples are – Multivariate Data with Missing Values

o Example: Bivariate Normal Data with Missing Values Least Squares with Missing Data

o Example: Linear Regression with Missing Dependent Valueso Example: Missing Values in a Latin Square Design

Example: Multinomial with Complex Cell Structure Example: Analysis of PET and SPECT Data Example: Mixture distributions Example: Grouped, Censored and Truncated Data

o Example: Grouped Log Normal Datao Example: Lifetime distributions for censored data

Page 31: Expectation-Maximization  (EM) Algorithm

31Dr. M. R. Karim, Stats, R.U.

Advantages of EM algorithm (1) The EM algorithm is numerically stable, with each

EM iteration increasing the likelihood Under fairly general conditions, the EM algorithm

has reliable global convergence (depends on initial value and likelihood!). Convergence is nearly always to a local maximizer.

The EM algorithm is typically easily implemented, because it relies on complete data computations

The EM algorithm is generally easy to program, since no evaluation of the likelihood nor its derivatives is involved

Page 32: Expectation-Maximization  (EM) Algorithm

32Dr. M. R. Karim, Stats, R.U.

Advantages of EM algorithm (2) The EM algorithm requires small storage space and can

generally be carried out on a small computer (it does not have to store the information matrix nor its inverse at any iteration).

The M-step can often be carried out using standard statistical packages in situations where the complete-data MLE’s do not exist in closed form.

By watching the monotone increase in likelihood over iterations, it is easy to monitor convergence and programming errors.

The EM algorithm can be used to provide estimated values of the “missing” data.

Page 33: Expectation-Maximization  (EM) Algorithm

33Dr. M. R. Karim, Stats, R.U.

Criticisms of EM algorithm Unlike the Fisher’s scoring method, it does not have an

inbuilt procedure for producing an estimate of the covariance matrix of the parameter estimates.

The EM algorithm may converge slowly even in some seemingly innocuous problems and in problems where there is too much ‘incomplete information’.

The EM algorithm like the Newton-type methods does not guarantee convergence to the global maximum when there are multiple maxima (in this case, the estimate obtained depends upon the initial value).

In some problems, the E-step may be analytically intractable, although in such situations there is the possibility of effecting it via a Monte Carlo approach.

Page 34: Expectation-Maximization  (EM) Algorithm

34Dr. M. R. Karim, Stats, R.U.

References (1)1. Dempster AP, Laird NM, Rubin, DB (1977) Maximum

likelihood from incomplete data via the EM algorithm (with discussion). J Royal Statist Soc - B 39:1–38

2. Hartley HQ (1958) Maximum likelihood estimation from incomplete data. Biometrics 14:174-194

3. Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York

4. Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J Royal Statist Soc - B 44:226–233

5. McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. John Wiley & Sons, Inc., New York

Page 35: Expectation-Maximization  (EM) Algorithm

35Dr. M. R. Karim, Stats, R.U.

References (2)

6. Meng XL, Rubin DB (1991) Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J Am Statist Assoc 86:899-909

7. Oakes D (1999) Direct calculation of the information matrix via the EM algorithm. J Royal Statist Soc - B 61:479-482

8. Rao CR (1972) Linear Statistical Inference and its Applications. John Wiley & Sons, Inc., New York

9. Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195-239

Page 36: Expectation-Maximization  (EM) Algorithm

36Dr. M. R. Karim, Stats, R.U.

References (3)

10. Wei, G.C.G. and Tanner, M.A. (1990a). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85, 699-704.

11. Wei, G.C.G. and Tanner, M.A. (1990b). Posterior computations for censored regression data. Journal of the American Statistical Association 85, 829-839.

12. Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Statist 11:95-103

Page 37: Expectation-Maximization  (EM) Algorithm

37Dr. M. R. Karim, Stats, R.U.

Thank You