expectation- maximization algorithm and applicationseugenew/publications/em-talk.pdf · can prove...
TRANSCRIPT
Expectation-Maximization Algorithm and Applications
Eugene WeinsteinCourant Institute of Mathematical SciencesNov 14th, 2006
2/31
List of Concepts
Maximum-Likelihood Estimation (MLE)Expectation-Maximization (EM)Conditional ProbabilityMixture ModelingGaussian Mixture Models (GMMs)String edit-distanceForward-backward algorithms
3/31
Overview
Expectation-MaximizationMixture Model TrainingLearning String Edit-Distance
4/31
One-Slide MLE ReviewSay I give you a coin with
But I don’t tell you the value of θNow say I let you flip the coin n times
You get h heads and n-h tailsWhat is the natural estimate of θ ?
This isMore formally, the likelihood of θ is governed by a binomial distribution:
Can prove is the maximum-likelihood estimate of θDifferentiate with respect to θ, set equal to 0
5/31
EM MotivationSo, to solve any ML-type problem, we analytically maximize the likelihood function?
Seems to work for 1D Bernoulli (coin toss)Also works for 1D Gaussian (find µ, σ 2 )
Not quiteDistribution may not be well-behaved, or have too many parametersSay your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)Direct maximization is not feasible
Solution: introduce hidden variables toSimplify the likelihood function (more common)Account for actual missing data
6/31
Hidden and Observed VariablesObserved variables: directly measurable from the data, e.g.
The waveform values of a speech recordingIs it raining today?Did the smoke alarm go off?
Hidden variables: influence the data, but not trivial to measure
The phonemes that produce a given speech recordingP (rain today | rain yesterday)
Is the smoke alarm malfunctioning?
7/31
Expectation-MaximizationModel dependent random variables:
Observed variable xUnobserved (hidden) variable y that generates x
Assume probability distributions:θ represents set of all parameters of distribution
Repeat until convergenceE-step: Compute expectation of
(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q
8/31
Let X, Y be r.v.’s drawn from the distributions P(x) and P(y)
Conditional distribution given by:Then
For function h(Y ):
Given a particular value of X (X=x):
Conditional Expectation Review
9/31
Maximum Likelihood ProblemWant to pick θ that maximizes the log-likelihood of the observed (x) and unobserved (y) variables given
Observed variable xPrevious parameters θ ′
Conditional expectation of given x and θ ′ is
10/31
EM DerivationLemma (Special case of Jensen’s Inequality): Let p(x), q(x) be probability distributions. Then
Proof: rewrite as:
Interpretation: relative entropy non-negative
11/31
EM DerivationEM Theorem:
Ifthen
Proof:
By some algebra and lemma,
So, if this quantity is positive, so is
12/31
EM SummaryRepeat until convergence
E-step: Compute expectation of
(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q
EM Theorem: Ifthen
InterpretationAs long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable xActually, it’s not necessary to maximize the expectation, just need to make sure that it increases – this is called “Generalized EM”
13/31
EM CommentsIn practice, the x is series of data points
To calculate expectation, can assume i.i.d and sum over all points:
Problems with EM?Local maximaNeed to bootstrap training process (pick a θ )
When is EM most useful?When model distributions easy to maximize (e.g., Gaussian mixture models)
EM is a meta-algorithm, needs to be adapted to particular application
14/31
Overview
Expectation-MaximizationMixture Model TrainingLearning String Distance
15/31
EM Applications: Mixture ModelsGaussian/normal distribution
Parameters: mean µ and variance σ 2
In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensionsWe can model arbitrary distributions with density mixtures
16/31
Density MixturesCombine m elementary densities to model a complex data distribution
kth Gaussian parametrized by
17/31
Density MixturesCombine m elementary densities to model a complex data distribution
18/31
Density MixturesCombine m elementary densities to model a complex data distribution
Log-likelihood function of the data x given :
Log of sum – hard to optimize analytically!Instead, introduce hidden variable y
: x generated by Gaussian k
EM formulation: maximize
19/31
Gaussian Mixture Model EMGoal: maximize n (observed) data points: n (hidden) labels:
: xi generated by Gaussian kSeveral pages of math later, we get:E step: compute likelihood of
M step: update αk, µk, σk for each Gaussian k =1..m
20/31
GMM-EM DiscussionSummary: EM naturally applicable to training probabilistic modelsEM is a generic formulation, need to do some hairy math to get to implementationProblems with GMM-EM?
Local maximaNeed to bootstrap training process (pick a θ )
GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc.Hours of fun with GMM-EM
21/31
Overview
Expectation-MaximizationMixture Model TrainingLearning String Distance
22/31
String Edit-DistanceNotation:Operate on two strings:Edit-distance: transform one string into another using
Substitution: kitten bitten, costInsertion: cop crop, costDeletion: learn earn, cost
Can compute efficiently recursively
23/31
Stochastic String Edit-DistanceInstead of setting costs, model edit operation sequence as a random processEdit operations selected according to a probability distribution
For edit operation sequenceView string edit-distance as
memoryless (Markov):stochastic: random process according to δ (⋅) is governed by a true probability distributiontransducer:
24/31
Edit-Distance TransducerArc label a:b/0 means input a, output b and weight 0Assume
25/31
Two DistancesDefine yield of an edit sequence ν (zn#)as the set of all strings hx,yi such that zn# turns x into yViterbi edit-distance: negative log-likelihood of most likely edit sequence
Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y
26/31
Evaluating LikelihoodViterbi:Stochastic:
Both require calculation of over all possible edit sequences
possibilities (three edit operations)However, memoryless assumption allows us to compute likelihood efficientlyUse the forward-backward method!
27/31
ForwardEvaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pairMemoryless assumption allows efficient recursive computation:
28/31
BackwardEvaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pairMemoryless assumption allows efficient recursive computation:
29/31
EM FormulationEdit operations selected according to a probability distribution
So, EM has to update δ based on occurrence counts of each operation (similar to coin-tossing example)Idea: accumulate expected counts from forward, backward variablesγ(z): expected count of edit operation z
30/31
EM Detailsγ(z): expected count of edit operation ze.g,
31/31
ReferencesA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38.C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997M. Collins, The EM Algorithm, 1997J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286.A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003Carlo Tomasi, Estimating Gaussian Mixture Densities with EM – A Tutorial, 2004Wikipedia