Download - Olivier Cappé's talk at BigMC March 2011

Online EM Algorithm and Some Extensions

Olivier Cappe

Telecom ParisTech & CNRS

March 2011

0. Cappe (@ BigMC) Online EM Algorithm March 2011 1 / 34

Online Estimation for Missing Data ModelsBased on (C & Moulines, 2009) and (C, 2010)

Goals

1 Maximum likelihood estimation, or

1’ Competitive with maximum likelihoodestimation when #obs. is large

2 Good scaling (performance vs. computational cost) as #obs.increases

(3) Process data on-the-fly (no storage)

4 Simple to implement (no line-search, projection,preconditioning, etc.)


Outline

1 The EM Algorithm in Exponential Families

2 The Limiting EM Recursion

3 Online EM AlgorithmThe AlgorithmProperties and Discussion

4 Use for Batch ML Estimation

5 Extensions

6 References


The EM Algorithm in Exponential Families

Missing Data Model

A missing data model is a statistical model {pθ(x, y)}θ∈Θ in which only Ymay be observed (the couple (X,Y ) is referred to as the complete data)

Hence, parameter estimates θn must be function of observationsY1, . . . , Yn only (here assumed to be independent and identicallydistributed)

Of course, the statistical model could also be defined as {fθ(y)}θ∈Θ,where fθ(y) =

∫pθ(x, y)dx but the specific structure of fθ needs to

be exploited

To analyze the methods the data {Yt}t≥1 is assumed to be generated byan iid. process with marginal π, not necessarily equal to fθ?



Finite Mixture Model

Mixture PDF

f(y) =m∑i=1

αifi(y)

Missing Data Interpretation

P(Xt = i) = αi

Yt|Xt = i ∼ fi(y)



To determine the maximum likelihood estimate

θn = arg maxθ

n∑t=1

log fθ(Yt)

numerically, the standard approach is the following.

Expectation-Maximization (Dempster, Laird & Rubin, 1977)

Given a current parameter guess θkn

E-Step Compute

qn,θkn(θ) =1n

n∑t=1

Eθkn [ log pθ(Xt, Yt)|Yt]

M-Step Update the parameter estimate to

θk+1n = arg max

θ∈Θqn,θkn(θ)



Rationale

1 It is an ascent algorithm (shown using Jensen inequality)

Figure: The EM intermediate quantity is a minorizing surrogate

2 Because of Fisher relation, the algorithm can only stop in a stationarypoint of the log-likelihood∗

∗See (Wu, 1983) for necessary topologicaland regularity assumptions



An Example: Poisson Mixture

Likelihood

fθ(Y ) =m∑j=1

αjλjY

Y !e−λj

“Complete-Data” Log-Likelihood

log pθ(X,Y ) = − log(Y !)

+m∑j=1

[log(αj)− λj ]1{X = j}

+m∑j=1

log(λj)Y 1{X = j}



EM Algorithm for the Poisson Mixture

EM E-Step

qn,θkn =m∑j=1

[log(αj)− λj ]1n

n∑t=1

Pθkn(Xt = j|Yt)

+m∑j=1

log(λj)1n

n∑t=1

YtPθkn(Xt = j|Yt)

EM M-Step

αk+1n,j =

1n

n∑t=1

Pθkn(Xt = j|Yt)

λk+1n,j =

∑nt=1 YtPθkn(Xt = j|Yt)∑nt=1 Pθkn(Xt = j|Yt)



Exponential Family Model

In the following, we assume that the complete-data model belongs to anexponential family

(Curved) Exponential Family Model

pθ(x, y) = exp (〈s(x, y), ψ(θ)〉 −A(θ))

where s(x, y) is the vector (complete-data) sufficientstatistics

Explicit Complete-Data Maximum Likelihood

S 7→ θ(S) = arg maxθ〈S, ψ(θ)〉 −A(θ)

is available in closed-form



The EM Algorithm Revisited

The k-th EM Iteration (From n Observations)

E-Step

Sk+1n =

1n

n∑t=1

Eθkn [s(Xt, Yt)|Yt]

M-Step

θk+1n = θ

(Sk+1n

)


The Limiting EM Recursion

A Key Remark

The k-th EM Iteration (From n Observations)

E-Step

Sk+1n =

1n

n∑t=1

Eθkn [s(Xt, Yt)|Yt]

M-Step

θk+1n = θ

(Sk+1n

)

Can be fully reparameterized in the domain of sufficient statistics

Sk+1n =

1n

n∑t=1

Eθ(Skn) [s(Xt, Yt)|Yt]




By letting n tend to infinity, one obtains two equivalent updates:

Sufficient Statistics Update

Sk = Eπ(

Eθ(Sk−1) [s(X1, Y1)|Y1])

Parameter Updateθk = θ {Eπ (Eθk−1 [s(X1, Y1)|Y1])}

Using usual EM arguments, these updates are such that

1 The Kullback-Leibler divergence D(π|fθk) is monotonically decreasingwith k

2 Converge to {θ : ∇θD(π|fθ) = 0}



Batch EM Is Not Efficient for Large Data Recordssee also (Neal & Hinton, 1999)

0

1

2

3

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Batch EM iterations

||u||2

20 103 observations

0

1

2

3

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

||u||2

2 103 observations

Figure: Convergence of batch EM estimates of ‖u‖2 as a function of the number of EM iterations for 2,000 (top) and20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulateddata. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussianapproximation of the MLE (from [C, 2010]).


Online EM Algorithm The Algorithm

The Online EM Algorithm

The online EM algorithm outputs one updated parameter estimate θnafter processing each individual observation Yn

The parameter update is very similar to applying the EM algorithm tothe single observation Yn (with smoothing)

The memory footprint of the algorithm is constant while itscomputational cost is proportional to the number of processedobservations



Online EM: Rationale

We try to locate the solutions of

Eπ(

Eθ(S) [s(X1, Y1)|Y1])− S = 0

Viewing Eθ(S) [s(Xn, Yn)|Yn] as a noisy observation of

Eπ(

Eθ(S) [s(X1, Y1)|Y1])

, this is exactly the usual Stochastic

Approximation (or Robbins-Monro) setup:

Sn = Sn−1 + γn

(Eθ(Sn−1) [s(Xn, Yn)|Yn]− Sn−1

)where (γn) is a sequence of decreasing positive stepsizes



The Algorithm

Online EM Algorithm

Stochastic E-Step

Sn = (1− γn)Sn−1 + γnEθn−1 [s(Xn, Yn)|Yn]

M Stepθn = θ(Sn)

Practical Recommendations

γn = 1/nα with α ∈ [0.6, 0.7]Don’t do M for the first 10–20 obs.(optional) Use Polyak-Ruppert averaging (requires tochose n0)



Online EM in the Poisson Mixture Example

SA E-Step

Computing Conditional Expectations

pn,j =αn−1,jλ

Ynn−1,je

−λn−1,jPmi=1 αn−1,iλ

Ynn−1,ie

−λn−1,i

Statistics Update (Stochastic Approximation)

Sαn,j = (1− γn)Sαn−1,j + γnpn,j

Sλn,j = (1− γn)Sλn−1,j + γnpn,jYn

M-Step: Parameter Update

αn,j = Sαn,j , λn,j = Sλn,j/Sαn,j


Online EM Algorithm Properties and Discussion

Analysis(C & Moulines, 2009)

Under∑

n γn =∞,∑

n γ2n <∞, compactness of Θ and other regularity

assumptions

1 The estimate θn converges to one of the roots of ∇θD(π|fθ) = 0

2 The algorithm is asymptotically equivalent to

θn = θn−1 + γnJ−1(θn−1)∇θ log fθn−1(Yn)

where J(θ) = −Eπ(Eθ[∇2θ log pθ(X1, Y1)

∣∣Y1

])3 For a well specified model (π = fθ?) and under Polyak-Ruppert

averaging† θn is Fisher efficient

√n(θn − θ?)

L−→ N (0, If (θ?))

where If (θ?) = −Eθ? [∇2θ log fθ(Y1)]

†θn = 1/(n− n0)Pnt=n0+1 θn,

with γn = n−α and α ∈ (1/2, 1)0. Cappe (@ BigMC) Online EM Algorithm March 2011 19 / 34


Some More Details

1 (Andrieu et al., 2005) but also (Delyon, 1994), (Benaım, 1999) usingthe fact that D(π|fθ(S)) is a Lyapunov function:⟨

∇SD(π|fθ(S)) , Eπ(

Eθ(S) [s(X1, Y1)|Y1])− S︸︷︷︸

mean field

⟩≤ 0

2 Taylor series expansion of θ to establish the equivalence (withremainder a.s. o(γn))

3 (Pelletier, 1998) to show that

γ−1/2n (θn − θ?)

L−→ N (0, I−1p (θ?)/2)

in well-specified models (where Ip is the complete-data Fisherinformation matrix)General results of (Polyak and Judistky, 1992), (Mokkadem andPelletier, 2006) on averaging



Illustration of Polyak-Ruppert Averaging

0 200 400 600 800 1000 1200 1400 1600 1800 2000−2

0

2

u 1

Number of observations

α = 0.6 with halfway averaging

0 200 400 600 800 1000 1200 1400 1600 1800 2000−2

0

2

u 1

α = 0.6

0 200 400 600 800 1000 1200 1400 1600 1800 2000−2

0

2

u 1

α = 0.9

Figure: Four superimposed trajectories of the estimate of u1 (first component of u) for various algorithm settings(α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal tozero.



Performance of Online EM

0

1

2

3

4

0.2 10^3 2 10^3 20 10^3

Number of observations

||u||2

α = 0.6 with halfway averaging

01234

0.2 10^3 2 10^3 20 10^3

||u||2

α = 0.6

01234

0.2 10^3 2 10^3 20 10^3||u

||2

α = 0.9

Figure: Online EM estimates of ‖u‖2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) andalgorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). Thebox-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data.The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussianapproximation of the MLE.



Related Works

(Titterington, 1984) Proposes a gradient algorithm

θn = θn−1 + γnI−1p (θn−1)∇θ log fθn−1(Yn)

It is asymptotically equivalent to the algorithm (previouslydescribed) for well-specified models (π = fθ?)

(Neal & Hinton, 1999) Describe an algorithm called Incremental EM thatis equivalent (up to first batch scan only) to Online EM usedwith γn = 1/n

(Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide someanalysis in the flat model case and for mixtures of Gaussian



How Does This Work in Practice?

Fine But don’t use ‡ γn = 1/n

Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressions

Large Scale Experiments on Real Data in (Liang & Klein, 2009), wherethe use of mini-batch blocking was found useful:

Apply the proposed algorithm consideringYmk+1, Ymk+2 . . . Ym(k+1) as one observation

Mini-batch blocking is useful in dealing with mixture-likemodels with infrequent components

‡γn = γ0/(n0 + n) can be an optionbut requires carefully setting γ0 and n0



Some Intuition About the Weights

If rk = (1− γk)rk−1 + γkEk, for k ≥ 1

1 rn =∑nk=1 ω

nkEk + ωn0 r0 with∑n

k=0 ωnk = 1

2 ωnk = 1n+a (for k ≥ 1) when

γk = 1/(k + a) and is strictlyincreasing otherwise

3∑nk=1(ωnk )2 ≡ 1

2n−α when γk = k−α,

with 1/2 < α < 10 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

2

4x 10

−3 α = 0.6

0

2

4x 10

−4 α = 0.9

1

1

1x 10

−4 α = 1


Use for Batch ML Estimation

How to Use Online EM for Batch ML Estimation?

The most popular use of the method is to perfom batch ML estimationfrom very large datasets

Because we did not assume that π = fθ? , the previous analysis can beapplied to π ≡ the empirical measure associated with Y1, . . . , Yn

Online EM can be used for batch ML estimation by (randomly)scanning the data Y1, . . . , Yn

Convergence “speed” (with averaging) is (nobs. × nscans)−1/2 versus

ρnscans for batch EM

Not a fair comparison in terms of computing time as the M-Step isnot free and possible parallelization is ignored



Comparison With Batch and Incremental EM

−1.58

−1.56

−1.54

1 2 3 4 5

batch tours

Online EM

−1.58

−1.56

−1.54

1 2 3 4 5

Incremental EM

−1.58

−1.56

−1.54

1 2 3 4 5

Batch EM

Figure: Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM andonline EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the boxan whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0.



Comparison With Batch and Incremental EM (Contd.)

−1.6

−1.58

−1.56

1 2 3 4 5

batch tours

Online EM

−1.6

−1.58

−1.56

1 2 3 4 5

Incremental EM

−1.6

−1.58

−1.56

1 2 3 4 5

Batch EM

Figure: Same display for a data record of length N = 1,000.


Extensions

Summary

The Good Easy (esp. when EM implementation is available)Can be used for ML estimation from a batch ofobservationsRobust wrt. to stepsize selection (note that scale isfixed due to the use of convex combinations)Handles parameter constraints nicely (only requires thatS be closed under convex combinations with expectedsufficient statistics)


Extensions

Summary (Contd.)

The Bad Needs that the E-step be explicitNeeds that θ be explicitNot appropriate for short (say, less than 1000observations) data records without cyclingWhat about non-independent observations?


Extensions

Online EM in Latent Factor Models (Ongoing Work)

Many models of the form

Cn|Hn ∼ gPKk=1 θkHn,k

where {gλ}λ∈Λ is an exponential family of distributions and Hn is a latentrandom vector of positive weights (probabilistic matrix factorization,discrete component analysis, partial membership models, simplicialmixtures)

Figure: Bayesian network representations of Latent Dirichlet Allocation (LDA)


Extensions

Simulated Online EM Algorithm for LDA

For n = 1, . . .Simulated E-step

Simulate Hn given Cn and θn−1

(in practise, using a short run ofMetropolis-Hastings or collapsedGibbs sampling)Use the Rao-Blackwellized update

Sn = (1−γn)Sn−1+γnEθn−1

[s(Zn,Wn)|Wn, Hn

]M-step θn = θ(Sn)


Extensions

Ignoring the sampling bias, this recursion can be analyzed and has thesame asymptotic properties as the online EM algorithm

In particular, for well-specified models,

γ−1/2n (θn − θ?)

L−→ N (0, I−1f (θ?))

instead ofγ−1/2n (θn − θ?)

L−→ N (0, I−1p (θ?))

for the “exact” online EM algorithm (Ip(θ?) = −Eθ? [∇2θ log pθ(X1, Y1)]).


References

Cappe, O. & Moulines, E. (2009). On-line expectation-maximization algorithm forlatent data models. J. Roy. Statist. Soc. B, 71(3):593-613.

Cappe, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K.,Titterington, M., & Robert, C. P., eds., Mixtures, Wiley.

Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In ProcNAACL Conference.

Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justifiesincremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphicalmodels, pages 355–368. MIT Press, Cambridge, MA, USA.

Rohde, D. & Cappe, O. (2011). Online maximum-likelihood estimation for latentfactor models. Submitted.

Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. InternationalConference on Neural Information Processing, 1:476–481.

Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussiannetwork. Neural Computation, 12:407-432.

Titterington, D. M. (1984). Recursive parameter estimation using incompletedata. J. Roy. Statist. Soc. B, 46(2):257-267.


Download - Olivier Cappé's talk at BigMC March 2011

Top Related