the expectation-maximization algorithm

8/12/2019 The Expectation-maximization Algorithm

1/14

The Expectat ioMax im i za t i on Algorithmcommon task in signal processing is the estimationof the parameters of a probability distribution func-. ion. Perhaps the most frequently encountered esti-

mation problem is the estimation of the mean of a signal innoise. In many parameter estimation problems the situationis more complicated because direct access to the data neces-sary to estimate the parameters is impossible, or some of thedata are missing. Such difficulties arise when an outcome isa result of an accumulation of simpler outcomes, or whenoutcomes are clumped together, for example, in a binning orhistogram operation. There may also be data dropouts or

maximization) algorithm is ideally suited to problems of thissort, in that it produces maximum-likelihood (ML) estimatesof parameters when there is a many-to-one mapping from anunderlying distribution to the distribution goveming the ob-servation. In this article, the EM algorithm is presented at alevel suitable for signal processing practitioners who havehad some exposure to estimation theory. (A brief summaryof ML estimation is provided in Box 1 for review.)

The EM algorithm consists of two major steps: an expec-tation step, followed by a maximization step. The expectationis with respect to the unknown underlying variables, usingclustcrinr i n such L w a y that the niimhcr 01undcrlying data points i s unl\no\\n (censor-ing ;tnd/or truncation). The FV (eq~ectation-

the cui-wit c\tiniatc of the parametcrs anclconilitioncil upon the observ;itions. Theniaxinii/atiun \ ~ e phcn provide\ a new e\ti-

NOVEMBER 1996 IEEE SIGNAL PROCESSING MAGA ZINEi v - w x R Q /cr, nnhiqahr mF

47


2/14


3/14

may be either a probability density function (pdf) or a prob-ability mass function (pmf).

A feature extractor is employed that can distinguish whichobjects are light and which are dark, but cannot distinguishshape. Let bI,y2]Ty be the number of dark objects and numberof light objects detected, respectively,so thaty , =xI + x2and yz= x, and let the corresponding random variables be Y , and Yz.There is a many-to-one mapping between x,,x2}nd yl. Forexample, ify , = 3,there is no way to tell from the measurementswhetherx1= 1andx, = 2 orx, = 2 andx = 1.The EM algorithmis specifically designed for problems with such many-to-onemapping\. Thcn seeBox Z),

(The symbol g is used to indicate the probability function forthe observed data.) From the observation of y I and y2, com-pute the ML estimate of p ,

where argmax means the value that maximizes the func-tion. In this example, it would be a simple matter to deter-mine an ML estimate of p . In more interesting problems,however. such straightforward estimation is not uossible. Inthe interest of introdicing the EM not takethe direct approach to the ML estimate. Taking the logarithmof the likelihood often simplifies the maximization and yieldsequivalent results since log is an increasing function, so Eq.

we I . An overview ofthe EM algorithm. After initialization, he E -step and the Ah te p are alternated until the parameter estimatehas converged (no more change in the estimate).

(3) may be written as

The idea behind the EM algorithm is that, even though wedo not know xI and x, knowledge of the underlying distribu-tionf(x,, x2,x31p)can be used to determine an estimate for p.This is done by first estimating the underlying data, thenusing these data to update our estimate of the parameter. Thisis repeated until convergence. LetpIk1ndicate the estimate ofp after the kth iteration, k = 1,2, ... An initial parameter valuep is assumed. The algorithm consists of two major steps:

Expectation Step (E-step). Compute the expected valueof the x data using the current estimate of the parameter andthe observed data.

The expected value of x l , given the measurement yI andbased upon the current estimate of the parameter, may becomputed as

2. Illustration of many-to-one mapping from X to Y. The point yis the image of x and the set X(y) is the inverse map of y.

( 5 )

Similarly,

In the current example,x3 s known and does not need to becomputed.sing the results of Box 2,

NOVEMBER 1996 IEEE SIGNAL PROCESSING MAGA ZINE 49


4/14

Maximization Step (M-step). Use the data from theexpectation step as if it were actually measured data todetermine an ML estimate of the parameter. This estimateddata is sometimes called imputed data.

In this example, with xl +l' and xZ"+" imputed and x3available, the ML estimate of the parameter is obtained bytaking the derivative of logf(x,ik ', xZ[k+l', ,lp) with respectto p , equating it to zero, and solving for p ,

7)

The estimate x,' '' is not used in Eq. (7) and so, for thisexample , need not be computed. The EM algorithm consistsof iterating Eqs. (6) and (7) until convergence. Intermediatecomputation and storage may be eliminated by substitutingEq. (6) into Eq. (7) to obtain a one-step update:

3. Representation of ET. There are B boxes in the body and D de-tectors surrounding the body.

As a numerical example , suppose that the true parameteris p = 0.5, n = 100 samples are drawn, with y , = 100. (Thetrue values of x, nd x2 are 25 and 38, respectively, but thealgorithm does not know this.) Table 1 illustrates the resultof the algorithm starting from p"' = 0. The final estimate p"= 0.52 is in fact the ML estimate of p that would have beenobtained by maximizing Eq. (1) with respect top, had the xdata been available.General Statement of the EM AlgorithmLet Y denote the sample space of the observations, and let yERW 'enote an observation from Y. Let x denote the under-lying space and let x E[W be an outcome from x with m < n.The data x is referred to as the complete dufu .The completedata x is not observed directly, but only by means of y, wherey = y(x), and y x) is a many-to-one mapping. An observationy determines a subset of x which is denoted as ~ ( y ) .igure2 illustrates the mapping.

The probability density function (pdf) of the complete dataisfx(xlO) =f xl0), where 8EOCR'is the set of parameters ofthe density. (We will refer to the density of the randomvariables for convenience, even for discrete random variablesfor which probability mass function (pmf) would be appro-priate. Subscripts indicating the random variable are sup-pressed, with the argument to the density indicating therandom variable.) The pdf f is assumed to be a continuousfunction of 0 and appropriately differentiable. The ML esti-mate of 0 is assumed to lie within the region 0. The pdf ofthe incomplete data is

Let

denote the likelihood function and let

denote the log-likelihood function.The basic idea behind the EM algorithm is that we would

like to find 0 to maximize logf(xle), but we do not have thedata x to compute the log-likelihood. So instead, we maxi-mize the expectation of log,f(xIe) given the data y and ourcurrent estimate of 8. This can be expressed in two steps. LetOrk be our estimate of the parameters at the kth iteration.

For the E-step compute:Q(81dk )= E[logf(~l8)ly,B'~']. (9)

It is important to distinguish between the first and secondarguments of the Q functions. The second argument is a

50 IEEE SIGNAL PROCESSING MAG AZINE NOVEMB ER 996


5/14

conditioning argument to the expectation and is regarded asfixed and known at every E-step. The first argument condi-tions the likelihood of the complete data.

For the M-step let O k+ l be that value of 0 which maximizesQ(O1O'k'):

It is important to note that the maximization is with respectto the first argument of the Q function, the conditioner of thecomplete data likelihood.

The EM algorithm consists of choosing an initial elk1henperforming the E-step and the M-step successively untilconvergence. Convergence may be determined by examiningwhen the parameters quit chan ging, i.e., stop whenI18ik1OLk-llII< E for some E and some appropriate distancemeasure (I.((

The general form of the EM algorithm as stated in Eqs. (9)and (IO) may be specialized and simplified somewhat by

restriction to distributions in the exponential family. Theseare pdfs (or pmfs) of the form

f(xie)= h x) exp[c(0IT (x)i/a(e) (11)where 8 is a vector of parameters for the family [ 3 5 , 3 6 ] .Thefunction t(x) is called the suflicient statistic of the family (astatistic is sufficient if it provides all of the informationnecessary to estimate the parameters o i he distribution fromthe data [ 3 5 , 3 6 ] ) .Members of the exponential family includemost distributions of engineering interest, including Gauss-ian, Poisson, binomial, uniform, Rayleigh, and others. Forexponential families, the E-step can be written as

Let t l k + l - E[t(~)ly,O'~'].s a conditional expectation is anestimator, t'k+''s an estimate of the sufficient statistic (The

NOVEMBER 1996 IEEE SIGNAL PROCESSING MA GAZIN E 51


6/14

.___ __

--------+NoisesourceProcessor

L ~ J4. Single-microphone ANC system.

5. Processor block diagram of the ANC system.EM algorithm is sometimes called the estimatiodmaximiza-tion algorithm because, for exponential families, the first stepis an estimator. It has also been called the expecta tionhod i-fication algorithm [9]). In light of the fact that the M-step willbe maximizing

E[log b(x)ly,e[k]+ c ey Lk+* og a 0)with respect to 8 and that E[log b(x)ly,q[] does not dependupon 0, it is sufficient to write:

E-step ComputeP+ E[t(x)ly,fP. (12)M-step Compute

The EM algorithm may be diagrammed starting from aninitial guess of the parameter f3 s follows:

M-slep-9 ... ,The EM algorithm has the advantage of being simple, at least

in principle; actually computing the expectations and perform-ing the maximizations may be computationally taxing. In adcl-tion, as discussed in the next section, every iteration of the EMalgorithm increases the likelihood function until a point of(local) maximum is reached. Unlike other optimization tech-niques, it is not necessary to compute gradients or Hessians, noris it necessary to wony about setting step-size parameters, asalgorithms such as gradient descent require.

Convergenceof the EM AlgorithmFor every iterative algorithm, the question of convergenceneeds to be addressed: does the algorithm come finally to asolution, or does it iterate ad m u s e u m , ever learning but nevercoming to a knowledge of the truth? For the EM algorithm,the convergence may be stated simply: at every iteration ofthe EM algorithm, a value of the parameter is computed sothat the likelihood function does not decrease. That is, atevery iteration the estimated parameter provides an increasein the likelihood function increases until a local maximum isachieved, at which point. the likelihood function cannot in-crease (but will not decrease). Box 3 contains a more precisestatement of this convergence for the general EM algorithm.

Despite the convergence heorem in Box3,there is no guaranteethat the convergence will be to a global maximum. For likelihoodfunctions with multiple maxima, convergence will be to a localmaximum which depends on the initial starting point 0.

The convergence rate of the EM algorithm is also ofinterest. Based on mathematical and empirical examinations,it has been determined that the convergence rate is usuallyslower than the quadratic convergence typically availablewith a Newtons-type method [4]. However, as observed byDempster [ I ] , the convergence near the maximum (at leastfor exponential families) depends upon the eigenvalues of theHessian 01he updale lunction M , so that rapid convergencemay be possible. In any event, even with potentially slowconvergence there are advantages to EM algorithms overNewtons algorithms. In the first place, no Hessian needs tobe computed. Also, there is no chance of overshooting thetarget or diverging away from the maximum. The EMalgorithm is guaranteed to be stable and to converge to anML estimate. Further discussion of convergence appears in[37, 381.Applications of the EM AlgorithmIn this section several applications of the EM algorithm toproblems of signal processing interest are presented to il lustratethe computations required in the steps of the algorithm and alsoto demonstrate the breadth of applications to which it may beapplied. The example in ET image reconstruction section andthe previous introductory example illustrate the case in whichthe densities are members of the exponential family. The otherexamples in this section treat densities that are not in theexponential family, so the more general statement of the EMalgorithm must be applied. The focus of the examples is on theEM algorithm; assumptions and details of the systems involvedare therefore not presented. The interested reader is encouragedto examine the references for details.

Introductory Example, Revisi tedThe multinomial distribution of the introductory example isa member of the exponential family with t(x) = x:

52 IEEE SIGNAL PROCESSING MAG AZINE NOVEMBER 1996


7/14

Box 2: Combination andconditional expectations multinomialsLet X Xz X3ave a multinomial distribution with class probabilities (pl,p2,p3), so

( x , + x z + x , ) x, *z 13P ( X , = x , , X , = x 2 , X 7 = x 3 ) = PI Pz P3x1 xz x3

outcomes can be combined to form aomes. Let Y = X I + X 2 . The probabilityP(Y,X3) can be determmed as follows:~ P ( x l = i , x z = y - l , X ~ = . x , )4

eorem. So ( X i ++ p z p 3 ) . ThisIY=y], it is firstity, P(Xi = xilYprobability can

f(x,,x29x,lp) =

( ~ ) e x ~ [ ~ l o P ~ . l o ~ ~ ] [ ~ ~ ] ] ( l q ), x2 x, 2 4 2 4

The E-step consists simply of estimating the underlyingdata, given the current estimate and the data. This is followed

a\sumed to be independent of each other. Let the set ofunknown parameters be denoted by h = { h( ),h(2),..,h ( B ) } .

A photon emission from box b is detected in tube d withprobability p(b,d) , and it inay be assumed that all emittedphotons are detected by some detector, so that

1 p

(14)& J ( b , d ) = 1by a straightforward maximization. d=lET Image ReconstructionIn ET [7], tissues within a body are stiinulated to emitphotons. These photons are detected by detectors surround-ing the tissue. For purposes of computation the body isdivided into B boxes. The number of photons generated ineach box is denoted by n(b),b = 1,2, .., B. The number ofphotons detected in each detector is denoted by y( d) , d =1,2 ...,D , as shown in Fig. 3. Let y = [y( ) ,y(2),...,y ( d ) ]denotethe vector of observations.

The generation of the photons from box b can be describedas a Poisson process with mean h(b), .e.,I@ f(nlh(b)) P(n(b)= nlh(b))= nThe parameter h(b) s a function of the tissue density so

that by estimating the parameters h(b) in each box it ispossible to construct an image of the body. The boxes are

Based upon the geometry of the sensors and the body it ispossible to determine p(b ,d ) . The detector variables y(d) arePoisson distributed,

U d ) Yf ( y l h ( d ) )= P ( y ( d ) = y ) = e-h(d) -

and it can be shown thatB

= E[y(d)l=C W ) P ( W .b=l

Let x b,d) be the number of emissions from box bdetected in detector d and let x = {x(b,d),b= 1, ..,B , d =l , . . . ,D}.For any given set of detector data { y ( d ) } ,here aremany different ways that the photons could have beengenerated. There is thus a many-to-one mapping fromx b,d) o y ( d ) ,and x constitutes the complete data set. Eachvariable of the complete data x b,d) s Poisson with mean

NOVEMBER 1996 IEEE SIGNAL PROCESSING MAGAZ INE 53


8/14

Assuming that each box generates independently of everyother box and that the detectors operate independently, thelikelihood function of the complete data is

e - b ( b , d ) h(b,d)x'b'd'l,(h)= f (x lh)= x(b,d)b=1, ..,Bd=1, ..,D

and, using Eq. (15), the log-likelihood function is

Application of the EM algorithm is st raightforward. Pois-son distributions are in the exponential family. The sufficientstatistics for the distribution are the data, t(x) = x.Let bethe estimate of the parameters at the kth iteration and letxrkl(b,d)e the estimate of the complete data. For the E-step,compute

where the latter equality follows since each box is inde-pendent. Since x b,d) s Poisson with mean (b,d) andy ( d ) = z f = , x ( b , d ) is Poisson with meanh'"(d) = f= ,h ' k l (b ,d ) ,he conditional expectation may becomputed (using techniques similar to those in Box 2)

d=l, ,DX[k+l ' (b ,)logp(b, )- ogx'k+l'(b,d)

h[k Il b)= C x " " ( b , d ) p ( b , d )I )d =I

where Eq. (1 4) has been used.Equations (18)and ( I 9)may be iterated until convergence.

The overhead of storing x"' ( b , d ) at each iteration may beeliminated by substituting Eq. ( 1 8) into Eq. (19) using Eq.(15), much as was done in the introductory example. Thisgives

Active Noise Cancellation ANC)

Active noise cancellation is accomplished by measuring anoise signal and using a speaker driven out of phase with thenoise to cancel it. In many traditional ANC techniques , twomicrophones are used in conjunction with an adaptive filterto provide cancellation (see, e.g., [39, 401). Using the EMalgorithm,ANC may be achieved with only one microphone[41]. The physical system is depicted in Fig. 4, with a blockdiagram for the ANC in Fig. 5.

The signal to be canceled is modeled as the output of anall-pole filter,

For the M-step,xLk l lb,d) s used in the likelihood function(17),which is maximized with respect to h(b):

where

6. Illustration of a four-state H M M showing the states, the distri-butions in each state, and some probabilistic transitions betweenthe states.

and u t) is a white, unit-variance, zero-mean Gaussian proc-ess. The signal r ( t ) is generated by the processor and corre-sponds to the input of the speaker; the delay z M is the delayfrom the speaker to the microphone. The signal T, v(t)modelsthe measurement error at the microphone. According to Fig.5, the input to the processor can be written as

y ( t ) = s ( t ) + cJcv(t);we assume that v(t) is a unit-variance, white Gaussian proc-ess. The set of unknown parameters is 8 = [a',cJ:, 021 .

A block of N measurements is used for processing. Theobserved data vector is

these observations span a set of autoregressive samples givenbY

54 IEEE SIGNAL PROCESSING MAG AZINE NOVEMBER 1996


9/14

= [$(I - p ) , $(2 p ) , ...,4 ~ 9 1 ~ .The complete data set is x = [y, If we knew s ,

estimation of the AR parameters would be straightforwardusing familiar spectrum estimation techniques.

The likelihood function for the complete data isAxle =f(y,slO) =f(yls,O)f(sle).The conditioning step provides important leverage be-

cause it is straightforward to determine f(yls,O). The condi-tioning can be further broken down as

Then

and (see [42, page 1871)

The E-step may be computed asE[logf(xlO)Iy,0[k]= logf(s,-,(0)10)- Nlogo, -N logo,

Taking the gradient with respect to a and derivatives withrespect to T, and T ~o maximize yields

( 2 2 )The expectations in Eqs. (20), 21), and (22) are first andsecond moments of Gaussians, conditioned upon observation

which may be computed using a Kalman smoother. Thevariable spmay be put into state-space form as

s , ( t ) = @ s , ( t - l ) + g u ( t )y ( t ) = h Tsp t ) o,v(t)

where

and

NOVEMBER 1996 IEEE SIGNAL PROCESSING MAGAZINE 55


10/14

7.Representation of signals in a spread-spectrum multiple-accesssystem.

T11 = [O )...) ,1].With an estimate of the parameters, the canceling signalc t+M) is obtained by estimating s ( t + M ) using E[s,,(~)ly,B]nd0IL1.HMMsThe hidden Markov model is a stochastic model of a processthat exhibits features that change over time. It has beenapplied in a broad variety of sequential pattern-recognitionproblems such as speech recognition and handwriting recog-nition [9, 431. An overview appeared in Signal ProcessingMagazine in [44]. Detailed descriptions of HMMs and theirapplication are given in [9, 10, 111.

A Markov chain is a stochastic model of a system that iscapable of being in a finite number of states { 1,2, ..,S } . Thecurrent state of the system is denoted by s,. The probabilityof transition from a state at the current (discrete) time t to anyother state at time t + 1 depends only on the current state, andnot on any prior states:

P(St+l = j St = iJ-1 = i l , ..) = P(St+l = j S t = i).It is common to express the transition probabilities as a matrixA with elements P(s,+,= j l s , = i ) = at,. The initial state so ischosen according to the probability

n = [P(so= l ) , ... P(S0 = S)] = [XI )...) s]TIn each state at time t , sl, a (possibly vector) random

variable is Y selected according to the densityf(Y, = ylls, =i )=fs y ) , as shown in Fig. 6. The variable y is observed, butthe underlying state is not, hence the name hidden Markovmodel. The set of densities f i 2 ,..., s denoted as fi ? .Thetriple ( A , 7 ~ 4 , ~ ~ )efines the HMM.

The HMM operates as follows: an initial state s g is chosenaccording to the probability law 7c. A succeeding state s, ischosen according to the Markov probability transition A . Anoutput y , is chosen according to Jy . Then a new state ischosen, and the process continues.

Let the elements of the HMM be parameterized by 0, i.e.,there is a mapping 0 A ( 8) , j ~ ( 0 ) , ~ ( ~10) ) .he mapping isassumed to be appropriately smooth. In practice, the initialprobability and transition probabilities are some of the ele-ments of 0. The parameter estimation problem for an HMMis this: given a sequence of observations, y = ( y , , y ,...,yr),determine the parameter 8 which maximizes the likelihoodfunction

That is, determine the initial state probabilities and thetransition probabilities, as well as any parameters of thedensity functions which maximize the likelihood function(23). From the complicated structure of ( 2 3 ) , t is clear thatthis is a complicated maximization problem. The EM algo-rithm, however, provides the power necessary to computewithout difficulty.

Let s = [ ~~~ s s~ . . . s ~] ~e a vector of the (unobserved)states. The complete data vector can be expressed as x = (y,s).The pdf of the complete data can be written as

This factorization, with the pdf of the observation condi-tioned upon the unknown state sequence and the distributionof the unknown state sequence, turns out to be the key stepin the application of the EM algorithm.

Because of the Markov structure of the state, the stateprobabilities in Eq. (24) may be written

1

t=l

The pdf of the observations, conditioned upon the unob-served states, factors as

T

1=1

We will assume that the density in each state is Gaussianwith known diagonal covariance and unknown mean, pJ.(Many other distributions are possible, e.g., discrete selec-tion, Poisson, exponential, or Gaussian with unknown meanand variance [45] ) Then

r r

27)Let s = (1,2,...,S}7+Lenote the set of all possible statesequences, including the initial state su. In the E-stepQ(8 0 ) = E [ og f(y,sl0) y,01,

56 IEEE SIGNAL PROCESSING MA GAZIN E NOVEMBER 1996


11/14

since the expectation is conditioned upon the observations,the only random component comes from the state variable.The E-step can thus be written as Matchedf-- Filter1I .

I__ Matched y2(i) ,r(t) Filter 2

The conditional probability is

bl 4 ,i

Signal b2(i) -etectionAlgorithm8. Multiple-access receiver matched-filter bank.Substituting from Eqs. (27) and ( 2 8 ) ,

Spread-Spectrum Multi-User Communication

In direct-sequence spread-spectrum multiple-access (SSMA)communications, all user- in a channel transmit simultane-ously, using quasi-orthogonal spreading codes to reduce theinter-user interference [46].The system block diagram isshown in Fig. 7. A signal received in a K-user system througha Gaussian channel may be written as

py+ll= arg max Q(elOLk')ws

The maximizations may be accomplished by differentiat-ing and equating the result to zero and solving for the appro-priate argument. For the mean, the result is

C,,,yf(~ls, elk1f (s le ' '>C, y , = y Y rc,,,f(yls,elkJ)f(sleikl)~t=L+ =

r ( t ) = S(t,b)+ (T N ( t )Efficient for computing this expression have where N ( t ) s unit-variance, zero-mean, white Gaussian noirebeen developed based upon forward and backward inductive

computation (dynamic programming or the Viterbi algo- andrithm); see e.g. [ lo, 1 I]. K MS(t,b)=CU, & ( i ) ~ ~ ( t - i T - ~ ~ )The Markov chain parameters K and a,, may also be k = l *=-Inobtained by maximizing Eq. (29) with constraints to preservethe probabilistic nature of the parameters: is the composite signal from all K transmitters. Here ak s the

amplitude of the kth transmitted signal (as seen at the re-Snik+ l= argma~Q(810'~')ubject to E n , = 1, K 0

11.1 r=lceiver), b represents the symbols of all the users, bk( i ) s theith bit of the Mh user, zk is the channel propagation delay forthe kth user, and s A t ) is the signaling waveform of the kth user-

S including the spreading code. For this example, coherent= argmaxQ(81e'k1) ubject to = 1, uz,,11 %I ,=I reception of each user is assumed so that the amplitudes arereal.

At the receiver the signal is Passed through a bank ofmatched filters, with a filter matched to the spreading signalof each of the users, as shown in Fig. 8. (This assumes that

This may be accomplished using Lagrange multipliers.Then the condition

synchronization for each user has been obtained.) The set ofmatched filter outputs for the ith bit interval is

T D(with h a Lagrange multiplier) leads to y i) = Lyi i), y2(i),.., y ~ ( i ) l .NOVEMBER 1996 IEEE SIGNAL PROCESSING MA GAZINE 7


12/14

Because the interference among the users is similar to in-tersymbol interference, optimal detection requires dealingwith the entire sequence of matched filter vectors

For a Gaussian channel, it may be shown thaty = H b)a + z, (30)

where H(b) depends upon the correlations between thespreading signals and the bits transmitted and z is non-white,zero-mean Gaussian noise. The likelihood function for thereceived sequence may be written as (see [47])

where R b) and S b) depend upon the bits and correlationsand c is a constant that makes the density integrate to 1.Notethat even though the noise is Gaussian, which is in theexponential family, the overall likelihood function is notGaussian because of the presence of the random bits t isactually a mixture of Gaussians. For the special case of onlya single user the likelihood function becomes

r 1 M 1

What is ultimately desired from the detector is the set ofbits for each user. It has been shown [46] that the inter-userinterference degrades the probability of error very little,provided that sophisticated detection algorithms are em-ployed after the matched filters. However, most of the algo-rithms that have been developed require knowledge of theamplitudes of each user [48]. Therefore, in order to determinethe bits reliably, the amplitude of each user must also beknown. Seen from the point of view of amplitude estimation,the bits are unknown nuisance parameters. (Other estimationschemes relying on decision feedback may take a differentpoint of view.)

If the bits were known, an ML estimate of the amplitudescould be easily obtained: a,n,= S b)-R(b)y. Lacking the bits,however, more sophisticated tools for obtaining the ampli-tudes must be applied as a precursor to detecting the bits. Oneapproach to estimating the signal amplitudes is the EM algo-rithm [47]. For purposes of applying the EM algorithm, thecomplete data set is x = { y , b} and the parameter set is 8 = a.To compute the expectations in the E-step, it is assumed thatthe bits are independent and equally likely f I .

The likelihood function of the complete data isflxla) =f y,bla) =f ylb,a)f bla). ( 3 2 )This conditioning is similar to that of Eqs. (19) and (24):

the complete-data ikelihood is broken into a likelihood of theobservation, conditioned upon the unobserved data times a

likelihood of the unobserved data. From Eq. 31),f ylb,a) isGaussian. To compute the E-stepE[logf xla)l y, aik 1] z f ( b l y , ak1)og f(xl a)

bs l+ l ) (M+ l)K

it is necessary to determine the conditional probabilityf bly,akl).It is revealing to consider a single-user system. In this casethe log-likelihood function is

and the E-step becomes

( 3 3 )The conditional probability required for the expectation is

(34)Substituting Eq. (34) into Eq. (33) yieldsa M ( 3 5 )o2 = yE[logf(xl a , I y, U : ] = yI i) tanh(ajky,i) / 02

a202 ( 2 M + 1) +constant

Conveniently, Eq. (35) is quadratic in a, and the M-step iseasily computed by differentiating Eq. ( 3 5 )with respect to a,,giving

Equation (36) gives the update equation for the amplitudeestimate, which may be iterated until convergence. For mul-tiple-users, the E-step and M-step are structurally similar, butmore involved computationally [47].SummaryThe EM algorithm may be employed when there is an under-lying set with a known distribution function that is observedby means of a many-to-one mapping. If the distribution of theunderlying complete data is exponential, the EM algorithm

58 IEEE SIGNAL PROCESSING MA GA ZINE NOVEMBER 1996


13/14

may be specialized as in Eqs. (12) and (13). Otherwise , it willbe necessary to use the general statement of the EM algorithm(Eqs. (9) and (10)).In many cases, the type of conditioningexhibited in Eqs. (19), (24) or (32) may be used: the observeddata is conditioned upon data not observed so that the likeli-hood function may be computed. In general, if the completedata set is x = y,z) for some unobserved z, then

E[log f(~l8)ly,8~]= If ~ly,8~)og f(xl8) dz,since, conditioned upon y the only random component of xis z.

Analytically, the most difficult portion of the EM algo-rithm is the E-step. This is also often the most difficultcomputational step; for the general EM algorithm, the expec-tation must be computed over all values of the unobservedvariables. There may be, as in the case of the HMM, efficientalgorithms to ease the computation, but even these cannotcompletely eliminate the computational burden.

In most instances where the EM algorithm applies, thereare other algorithms that also apply, such as gradient descent(see, e.g., [49]). As already observed, however, these algo-rithms may have problems of their own such as requiringderivatives or setting of convergence-rate parameters. Be-cause of its generality and the guaranteed convergence, theEM algorithm is a good choice to consider for many estima-tion problems. Future work will include application in newand different areas, as well as developments to improveconvergence speed and computational structure.Todd K. Moon is Associate Professor at the Electrical andComputer Engineering Department and Center for Self-Or-ganizing Intelligent Systems at Utah State University.References1. A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood fromincomplete data via the EM algorithm, J . Royal Sta tiscal Soc., Ser. R , vol.39, no.1, pp.1-38, 1977.2. For an extensive list of references to papers describing applications of theEM algorithm, see http://www.engineering/usu.edu/Departmenls/ece/Publi-cations/Moon on the World-Wide W eb.3. C. Jiang, The use of mixture models to detect effects of major genes onquantitative characteristics in a plant-breeding experiment, Generics, vol.136, no. 1, pp. 383-394, 1994.4. R. Redner and H.F. Walker, Mixture densities, maximum-likelihoodestimation and the EM algorithm (review), SIAM Rev. , vol. 26, no. 2, pp.195-237, 1984.5 . J. Schmee and G .J. Hahn, Simple method for regression analysis withcensored data, Technometrics, vol. 21, no. 4, pp. 417-432, 1979.6. R.Little and D.Ru bin, On jointly estimating parameters and missing databy maximizing the complete-data likelihood, Am. Statistn., vol. 37, no. 3,pp. 218-200, 1983.7. L.A. Shepp and Y.Vardi, Maximum likelihood reconstruction lor emis-sion tomography, IEEE Med. Im., vol.1, pp. 113-122, October 1982.8. D.L. Snyder and D.G. Politte, Image reconstruction from list-mode datain an emission tomography system having time-of-flight measurements,IEEENucl . S.,vol. 30, no. 3, pp. 1843-1849, 1983.9. L. Rabiner, A tutorial on hidden Markov m odels and selected applicationsin speech recognition, P. IEEE, vol. 77, no. 2, pp. 257-286, 1989.

10. L. Rabiner and B.-H. h a n g , Fnndamentals o Speech Recognition.Prentice-Hall, 1993.11. J.R. Deller, J.G. Proakis, and J.H.L. Hansen, Discrete-Time P rocessingo Speech Signals. Macmillan, 1993.12. M. Segal andE. Weinstein, Parameter estimationof continuous dynami-cal linear systems given discrete time observations, P. IEEE, ~0 1. 7 , no. 5,pp. 727-729, 1987.13. S. Zabin and H. Poor, Efficient estimation of class-A noise parametersvia the EM algorithm, IEEE Trans. Info. T., vol. 37, no. 1,pp. 60-72, 1991,14. A. Isaksson, Identification of ARX models subject to missing data,I EEEAut o C, vol. 38, no. 5, pp. 813-819, 1993.1.5. I. Zisknd and D. Hertz, Maximum likelihood localization of narrow-band autorcgressive sources via the EM algorithm, IEEE Trans. Sig. Proc.,vol. 41, no. 8, pp. 2719-2724, 1993.16. R. Lagendijk, J. Biemond, and D. Boekee, Identification and restorationof noisy blurred images using the expectation-maximization algorithm,IEEE Trans. ASSP , vol. 38, no. 7, pp. 1180-1191, 1990.17. A. Katsaggelos and K . Lay, Maximum likelihood blur identification andimage restoration using the algorithm, IEEE Trans. Sig. Proc., vol. 39, no.3, pp. 729-733, 1991.18. A . Ansari and R. Viswanathan, Application of EM algorithm to thedetection of direct sequence signal in pulsed noise jamming, IEEE Trans.Com.,vol .41,no . 8 ,pp . 1151-1154, 1993.19. M. Fcder, Parameter estimation and cxtraction of helicoptcr signalsobserved with a wide-band interference, IEEE Trans. Sig. Proc., vol. 41,no. 1, pp. 232-244, 1993.20. G. Kaleh, Joint parameter estimation and symbol detection for linearand nonlinear unknown channels, IEEE Trans. Com., vol. 42, no. 7 , pp.2506-2413, 1994.21, W. Byrne, Altemating minimization and Boltzman machine learning,IEEE Trans. Neural Net., vol. 3, no. 4, pp. 612-620, 1992.22. M . Jordan and R. Jacobs, Hierarchical mixtures of experts and the EMalgorithm, Neural Comp., vol. 6 , no. 2, pp. 181-214, 1994.23. R. Streit and T. Luginbuh, ML training of probabilistic neural net-works, IEEE Trans. Neural Net. , vol. 5 , no. 5, pp. 764-783, 1994.24. M. M iller and D . Fuhrmann, M aximum likelihood narrow-band direc-tion finding and the EM algorithm, IEEE Trans. ASSP , vol. 38, no. 9, pp.1560- 1577, 1990.25. S.Vaseghi and P. Rayner, Detection and sup pression of impulsive noisein speech communication systems, IEE Proc-I, vol. 137, no. I pp. 38-46,1990.26. E. Weinstein, A. Oppenheim, M. Feder, and J . Buck, Iterative andsequential alg ori thm for multisensor signal enhancement, IEEE Trans. Sig.Proc., vol. 42, no. 4, pp. 846-859, 1994.27. S .E. Bialkowski, Expectation-maximization (EM) algorithm fo r regres-sion, deconvolution, and smoothing of shot-noise limited data, Journal oChemomerrics, 1991.28. C. Georghiades and D. Snyder, The EM algorithm for symbol unsyn-chronized sequencedetection,IEEE Comun., vol. 39, no. 1, pp. 54-61,1991 .29. N. Antoniadis and A. Hero, Time-delay estimation for filtered Poissonprocesses using an EM-type algorithm, IEEE Trans. Sig. Proc., vol. 42, no.8, pp. 2112-2123, 1994.30. M. Segal and E. Weinstein, The cascade EM algorithm, P. IEEE, vol.76, no. 10, pp. 1388-1390, 1988.3 1 C. Gyulai, S. Bialkowski, G. S. Stiles, and L. Powers, A comparison ofthree multi-platform message-passing interfaces on an expecation-maximi-zation algorithm, in Proceedings of the 1993 World Conference onTransputers, pp. 4511164, 1993.32. R.E. Blahut, Computation of channel capacity and rate-distortion func-tions, IEEE Trans. Infor. Th.,vol. 18, pp. 460-473, July 1972.33. 1. Csiszar and G. Tusnday, Information geometry and altcrnatingminimization procedures, Statistics and Decisions, Supplement Issue I ,1984.

NOVEMBER 996 IEEE SIGNAL PROCESSING MAGAZ INE 59


14/14

the expectation-maximization algorithm

Documents