uva cs 6316/4501 – fall 2016 machine learning lecture 21: em … · 2020-01-12 · uva cs...
TRANSCRIPT
UVACS6316/4501–Fall2016
MachineLearning
Lecture21:EM(Extra)
Dr.YanjunQi
UniversityofVirginia
DepartmentofComputerScience
11/29/16
Dr.YanjunQi/UVACS6316/f15
1
Wherearewe?èmajorsecGonsofthiscourse
q Regression(supervised)q ClassificaGon(supervised)
q FeatureselecGonq Unsupervisedmodels
q DimensionReducGon(PCA)q Clustering(K-means,GMM/EM,Hierarchical)
q Learningtheoryq Graphicalmodels
q (BNandHMMslidesshared)11/29/16 2
Dr.YanjunQi/UVACS6316/f15
Today Outline• Principles for Model Inference
– Maximum Likelihood Estimation – Bayesian Estimation
• Strategies for Model Inference – EM Algorithm – simplify difficult MLE
• Algorithm • Application • Theory
– MCMC – samples rather than maximizing
11/29/16
Dr.YanjunQi/UVACS6316/f15
3
Model Inference through Maximum Likelihood Estimation (MLE)
Assumption: the data is coming from a known probability distribution The probability distribution has some parameters that are unknown to you
Example:dataisdistributedasGaussian,sotheunknownparametershereare
MLE is a tool that estimates the unknown parameters of the probability distribution from data
✓ = (µ,�2)yi = N(µ,�2)
11/29/16
Dr.YanjunQi/UVACS6316/f15
4
MLE: e.g. Single Gaussian Model (when p=1)
• Need to adjust the parameters (è model inference)
• So that the resulting distribution fits the observed data well
11/29/16
Dr.YanjunQi/UVACS6316/f15
5
Maximum Likelihood revisited yi = N(µ,�2)
!! Y = { y1 , y2 ,…, yN }
!!l(θ )= log(L(θ ;Y ))= log p( yi )
i=1
N
∏
)
11/29/16
Dr.YanjunQi/UVACS6316/f15
6
MLE: e.g. Single Gaussian Model
• Assume observation data yi are independent
• Form the Likelihood:
• Form the Log-likelihood:
!!
L(θ ;Y )= p( yi )i=1
N
∏ = 12πσ 2i=1
N
∏ exp(− ( yi − µ)2
2σ 2 );
Y = { y1 , y2 ,…, yN }
!!l(θ )= log( 1
2πσi=1
N
∏ exp(− ( yi − µ)2
2σ 2 ))= − ( yi − µ)22σ 2
i=1
N
∑ −N log( 2πσ )
11/29/16
Dr.YanjunQi/UVACS6316/f15
7
MLE: e.g. Single Gaussian Model
• To find out the unknown parameter values, maximize the
log-likelihood with respect to the unknown parameters:
∑∑=
= −=⇒=∂∂=⇒=
∂∂ N
ii
N
i i µyN
lNy
µµl
1
222
1 )(10;0 σσ
)
11/29/16
Dr.YanjunQi/UVACS6316/f15
8
MLE: A Challenging Mixture Example
histogram
Indicator variable
is the probability with which the observation is chosen from density model 2 (1- ) is the probability with which the observation is chosen from density 1
Mixture model:
}1,0{;)1(),(~);,(~
21
2222
2111
∈ΔΔ+Δ−= YYYσµNYσµNY
),();,( 222111 σµθσµθ ==
⇡⇡
Dr.YanjunQi/UVACS6316/f15
911/29/16
MLE: Gaussian Mixture Example
Maximum likelihood fitting for parameters:
Numerically (and of course analytically, too) Challenging to solve!!
),,,,( 2121 σσµµπ=θ
)
11/29/16
Dr.YanjunQi/UVACS6316/f15
10
Bayesian Methods & Maximum Likelihood
• Bayesian Pr(model|data) i.e. posterior =>Pr(data|model) Pr(model) => Likelihood * prior
• Assume prior is uniform, equal to MLE
argmax_model Pr(data | model) Pr(model) = argmax_model Pr(data | model)
11/29/16
Dr.YanjunQi/UVACS6316/f15
11
Today Outline• Principles for Model Inference
– Maximum Likelihood Estimation – Bayesian Estimation
• Strategies for Model Inference – EM Algorithm – simplify difficult MLE
• Algorithm • Application • Theory
– MCMC – samples rather than maximizing
11/29/16
Dr.YanjunQi/UVACS6316/f15
12
Here is the problem
11/29/16
Dr.YanjunQi/UVACS6316/f15
13
All we have is
FromwhichweneedtoinferthelikelihoodfuncGonwhichgeneratetheobservaGons
histogram
11/29/16
Dr.YanjunQi/UVACS6316/f15
14
Expectation Maximization: add latent variable è latent data
EM augments the data space– assumes with latent data
11/29/16
Dr.YanjunQi/UVACS6316/f15
15
Computing log-likelihood based on complete data
Maximizing this form of log-likelihood is now tractable
Note that we cannot analytically maximize the previous log-likelihood with only observed Y={y_1, y_2, …, y_n}
T = {ti = (yi,�i), i = 1...N}
Dr.YanjunQi/UVACS6316/f15
1611/29/16
EM: The Complete Data Likelihood
;)1(
)1(0
1
11
1
0
∑
∑
=
=
Δ−
Δ−=⇒=
∂∂
N
ii
N
iii yl µ
µ
;)1(
))(1(0
1
1
21
212
1
0
∑
∑
=
=
Δ−
−Δ−=⇒=
∂∂
N
ii
N
iii µy
σσl
By simple differentiations we have:
How do we get the latent variables?
So, maximization of the complete data likelihood is much easier!
11/29/16
Dr.YanjunQi/UVACS6316/f15
17
EM: The Complete Data Likelihood
;0
1
12
2
0
∑
∑
=
=
Δ
Δ=⇒=
∂∂
N
ii
N
iii yl µ
µ
;)(
0
1
1
22
222
2
0
∑
∑
=
=
Δ
−Δ=⇒=
∂∂
N
ii
N
iii µy
σσl
;0 10
Nπ
πl
N
ii∑
=
Δ=⇒=
∂∂
By simple differentiations we have:
How do we get the latent variables?
So, maximization of the complete data likelihood is much easier!
11/29/16
Dr.YanjunQi/UVACS6316/f15
18
Obtaining Latent Variables The latent variables are computed as expected values given the data and parameters:
),|1Pr(),|()( iiiii yyEθγ θθ =Δ=Δ=
Apply Bayes’ rule:
πyΦπyΦπyΦ
yyyyθγ
ii
i
iiiiii
iiiiii
)()1)(()(
)|0Pr(),0|Pr()|1Pr(),1|Pr()|1Pr(),1|Pr(),|1Pr()(
21
2
θθ
θ
θθθθθθθ
+−=
=Δ=Δ+=Δ=Δ=Δ=Δ==Δ=
11/29/16
Dr.YanjunQi/UVACS6316/f15
19
Dilemma Situation
• We need to know latent variable / data to maximize the complete log-likelihood to get the parameters
• We need to know the parameters to calculate the expected values of latent variable / data
• è Solve through iterations
11/29/16
Dr.YanjunQi/UVACS6316/f15
20
So we iterate è EM for Gaussian Mixtures…
Y Y
Y11/29/16
Dr.YanjunQi/UVACS6316/f15
21
EM for Gaussian Mixtures…
Y
11/29/16
Dr.YanjunQi/UVACS6316/f15
22
EM for Two-component Gaussian Mixture
• Initialize • Iterate until convergence
– Expectation of latent variables
– Maximization for finding parameters
)2
)(2
)(exp(11
1)()1)((
)()(
22
22
21
21
1
221
2
σµy
σµyπyΦπyΦ
πyΦθγ
iiii
ii −+−−−+
=+−
=
σσ
ππθθ
θ
;)1(
)1(
1
11
∑
∑
=
=
−
−= N
ii
N
iii
γ
yγµ ;
1
12
∑
∑
=
== N
ii
N
iii
γ
yγµ ;
)1(
))(1(
1
1
21
21
∑
∑
=
=
−
−−= N
ii
N
iii
γ
µyγσ ;
)(
1
1
22
22
∑
∑
=
=
−= N
ii
N
iii
γ
µyγσ ;1
N
γπ
N
ii∑
==
µ1,�1, µ2,�2,⇡
11/29/16
Dr.YanjunQi/UVACS6316/f15
23
EM in….simple words
• Given observed data, you need to come up with a generative model
• You choose a model that comprises of some hidden variables hi (this is your belief!)
• Problem: To estimate the parameters of model – Assume some initial values parameters – Replace values of hidden variable with their
expectation (given the old parameters) – Recompute new values of parameters (given ) – Check for convergence using log-likelihood
�i
�i
11/29/16
Dr.YanjunQi/UVACS6316/f15
24
EM – Example (cont’d)
Selected iterations of the EM algorithm For mixture example
Iteration � 1 0.485 5 0.493 10 0.523 15 0.544 20 0.546
⇡
histogram
11/29/16
Dr.YanjunQi/UVACS6316/f15
25
EM Summary • An iterative approach for MLE • Good idea when you have missing or latent
data • Has a nice property of convergence • Can get stuck in local minima (try different
starting points) • Generally hard to calculate expectation over
all possible values of hidden variables • Still not much known about the rate of
convergence
11/29/16
Dr.YanjunQi/UVACS6316/f15
26
Today Outline• Principles for Model Inference
– Maximum Likelihood Estimation – Bayesian Estimation
• Strategies for Model Inference – EM Algorithm – simplify difficult MLE
• Algorithm • Application • Theory
– MCMC – samples rather than maximizing
11/29/16
Dr.YanjunQi/UVACS6316/f15
27
Applications of EM
– Mixture models – HMMs – Latent variable models – Missing data problems – …
11/29/16
Dr.YanjunQi/UVACS6316/f15
28
Applications of EM (1)
• Fibngmixturemodels
11/29/16
Dr.YanjunQi/UVACS6316/f15
29
Applications of EM (2)
• ProbabilisGcLatentSemanGcAnalysis(pLSA)– Techniquefromtextfortopicmodeling
P(w,d)
P(w|z)
P(z|d)
Z
W D
Z
D
W
11/29/16
Dr.YanjunQi/UVACS6316/f15
30
Applications of EM (3)
• Learningpartsandstructuremodels
11/29/16
Dr.YanjunQi/UVACS6316/f15
31
Applications of EM (4)
• AutomaGcsegmentaGonoflayersinvideo
hhp://www.psi.toronto.edu/images/figures/cutouts_vid.gif
11/29/16
Dr.YanjunQi/UVACS6316/f15
32
Expectation Maximization (EM)
• Oldidea(late50’s)butformalizedbyDempster,LairdandRubinin1977
• SubjectofmuchinvesGgaGon.SeeMcLachlan&Krishnanbook1997.
11/29/16
Dr.YanjunQi/UVACS6316/f15
33
11/29/16
Dr.YanjunQi/UVACS6316/f15
34
single-variable + two-cluster case
11/29/16
Dr.YanjunQi/UVACS6316/f15
35
multi-variable + multi-cluster case
Today Outline• Principles for Model Inference
– Maximum Likelihood Estimation – Bayesian Estimation
• Strategies for Model Inference – EM Algorithm – simplify difficult MLE
• Algorithm • Application • Theory
– MCMC – samples rather than maximizing
11/29/16
Dr.YanjunQi/UVACS6316/f15
36
11/29/16
Dr.YanjunQi/UVACS6316/f15
37
WhyisLearningHarder?
• Infullyobservediidsebngs,thecompleteloglikelihooddecomposesintoasumoflocalterms.
• Whenwithlatentvariables,alltheparametersbecomecoupledtogetherviamarginaliza)on
),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l
!! l (θ ;D)= logp(x |θ )= log p(z |θz )p(x |z ,θ x )
z∑
Dr.YanjunQi/UVACS6316/f15
38
GradientLearningformixturemodels
• WecanlearnmixturedensiGesusinggradientdescentontheobservedloglikelihood.ThegradientsarequiteinteresGng:
• Inotherwords,thegradientistheresponsibilityweightedsumoftheindividualloglikelihoodgradients.
• CanpassthistoaconjugategradientrouGne.
∑∑
∑
∑
∑
∂∂=
∂∂
=
∂∂
=
∂∂
=∂∂
==
k k
kk
k k
kkkkk
k
kkkk
k
k
kkk
kkkk
rppp
ppp
pp
pp
θθθ
θθ
π
θθ
θθ
πθθ
πθθ
θπθθ
l
l
l
)x(log)|x()x(
)x(log)x(
)|x(
)x()|x(
)x(log)|x(log)(
1
11/29/16
11/29/16
Dr.YanjunQi/UVACS6316/f15
39
ParameterConstraints• Onenwehaveconstraintsontheparameters,e.g.
beingsymmetricposiGvedefinite.• WecanuseconstrainedopGmizaGon,orwecanre-
parameterizeintermsofunconstrainedvalues.– Fornormalizedweights,sonmaxtoe.g.
– Forcovariancematrices,usetheCholeskydecomposiGon:
whereAisupperdiagonalwithposiGvediagonal:
– Usechainruletocompute
AAT=Σ−1
( ) )( )( exp ijij ijijijiii <=>=>= 00 AAA ηλ
. ,A∂∂
∂∂ llπ
⌃k
π j
j=1
K
∑ = 1
11/29/16
Dr.YanjunQi/UVACS6316/f15
40
IdenGfiability• AmixturemodelinducesamulG-modallikelihood.• Hencegradientascentcanonlyfindalocalmaximum.• MixturemodelsareunidenGfiable,sincewecanalways
switchthehiddenlabelswithoutaffecGngthelikelihood.• Henceweshouldbecarefulintryingtointerpretthe“meaning”oflatentvariables.
11/29/16
Dr.YanjunQi/UVACS6316/f15
41
ExpectaGon-MaximizaGon(EM)Algorithm
• EMisanIteraGvealgorithmwithtwolinkedsteps:– E-step:fill-inhiddenvaluesusinginference:p(z|x,\thetat).– M-step:updateparameters(t+1)roundsusingstandardMLE/MAPmethodappliedtocompleteddata
• Wewillprovethatthisproceduremonotonicallyimproves(orleavesitunchanged).ThusitalwaysconvergestoalocalopGmumofthelikelihood.
11/29/16
Dr.YanjunQi/UVACS6316/f15
42
TheoryunderlyingEM• Whatarewedoing?
• RecallthataccordingtoMLE,weintendtolearnthemodelparameterthatwouldhavemaximizethelikelihoodofthedata.
• Butwedonotobservez,socompuGng
isdifficult!• Whatshallwedo?
∑∑ ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
11/29/16
Dr.YanjunQi/UVACS6316/f15
43
(1)IncompleteLogLikelihoods
• IncompleteloglikelihoodWithzunobserved,ourobjecGvebecomesthelogofamarginalprobability:
– ThisobjecKvewon'tdecouple
!! l (θ ;x)= logp(x |θ )= log p(x ,z |θ )
z∑
11/29/16
Dr.YanjunQi/UVACS6316/f15
44
(2)CompleteLogLikelihoods
• CompleteloglikelihoodLetXdenotetheobservablevariable(s),andZdenotethelatentvariable(s).IfZcouldbeobserved,then
– Usually,opGmizinglc()givenbothzandxisstraightorward(c.f.MLEforfullyobservedmodels).
– RecalledthatinthiscasetheobjecGvefor,e.g.,MLE,decomposesintoasumoffactors,theparameterforeachfactorcanbeesGmatedseparately.
– ButgiventhatZisnotobserved,lc()isarandomquanKty,cannotbemaximizeddirectly.
!! l c(θ ;x ,z)=deflogp(x ,z |θ )= logp(z |θz )p(x |z ,θ x )
Completelog-likelihood(CLL)
Log-likelihood[Incompletelog-likelihood(ILL)]
Expectedcompletelog-likelihood(ECLL)
Threetypesoflog-likelihoodovermulGpleobservedsamples(x_1,x_2,…,x_N)
Observeddata
Latentvariables
IteraGonindex
11/29/16
Dr.YanjunQi/UVACS6316/f15
(3)ExpectedCompleteLogLikelihood
• ForanydistribuGonq(z),defineexpectedcompleteloglikelihood(ECLL):
• CLLisrandomvariableèECLLisadeterminisGcfuncGonofq
• LinearinCLL()---inherititsfactorizabiility• Doesmaximizingthissurrogateyieldamaximizerofthelikelihood?
!! ECLL= l c(θ ;x ,z) q
=def
q(z |x ,θ )logp(x ,z |θ )z∑
Jensen’sinequality
11/29/16
Dr.YanjunQi/UVACS6316/f15
47
11/29/16
Dr.YanjunQi/UVACS6316/f15
48
Jensen’sinequality
!! ECLL= l c(θ ;x ,z) q
=def
q(z |x ,θ )logp(x ,z |θ )z∑
!!
ILL= l (θ ;x)= logp(x |θ )= log p(x ,z |θ )
z∑
= log q(z |x)p(x ,z |θ )q(z |x)z
∑
≥ q(z |x)log p(x ,z |θ )q(z |x)z
∑
= q(z |x)logp(x ,z |θ )− q(z |x)logz∑ q(z |x)
z∑
= ECLL+Hq
qqc Hzxx +≥⇒ ),;();( θθ ll
• Jensen’sinequality
Entropyterm
11/29/16
Dr.YanjunQi/UVACS6316/f15
49
LowerBoundsandFreeEnergy
• Forfixeddatax,defineafuncGonalcalledthefreeenergy:
• TheEMalgorithmiscoordinate-ascentonF :
– E-step:
– M-step:
);()|()|,(
log)|(),(def
xxzq
zxpxzqqFz
θθθ l≤=∑
),(maxarg tq
t qFq θ=+1
),(maxarg ttt qF θθθ
11 ++ =
HowEMopGmizeILL?
11/29/16
Dr.YanjunQi/UVACS6316/f15
50
11/29/16
Dr.YanjunQi/UVACS6316/f15
51
E-step:maximizaGonofw.r.t.q • Claim:
– ThisistheposteriordistribuGonoverthelatentvariablesgiventhedataandtheparameters.OnenweneedthisattestGmeanyway(e.g.toperformclustering).
• Proof(easy):thissebngahainstheboundofILL
• CanalsoshowthisresultusingvariaGonalcalculusorthefactthat
),|(),(maxarg ttq
t xzpqFq θθ ==+1
);()|(log
)|(log),(
),()|,(log),()),,((
xxp
xpxzp
xzpzxpxzpxzpF
ttz
tt
zt
tttt
θθ
θθ
θθθθθ
l==
=
=
∑
∑
( )),|(||KL),();( θθθ xzpqqFx =−l
E-step: Alternative derivation
11/29/16
Dr.YanjunQi/UVACS6316/f15
52
( )),|(||KL),();( θθθ xzpqqFx =−l
11/29/16
Dr.YanjunQi/UVACS6316/f15
53
M-step:maximizaGonw.r.t.\theta
• Notethatthefreeenergybreaksintotwoterms:
– Thefirsttermistheexpectedcompleteloglikelihood(energy)andthesecondterm,whichdoesnotdependon q,istheentropy.
qqc
zz
z
Hzx
xzqxzqzxpxzqxzq
zxpxzqqF
+=
−=
=
∑∑∑
),;(
)|(log)|()|,(log)|()|()|,(log)|(),(
θ
θ
θθ
l
11/29/16
Dr.YanjunQi/UVACS6316/f15
54
M-step:maximizaGonw.r.t.\theta
• Thus,intheM-step,maximizingwithrespecttoqforfixedqweonlyneedtoconsiderthefirstterm:
– UnderopGmalqt+1, thisisequivalenttosolvingastandardMLEoffullyobservedmodelp(x,z|q),withthesufficientstaGsGcsinvolvingzreplacedbytheirexpectaGonsw.r.t.p(z|x,q).
∑== ++
zqc
t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθθθ
11 l
11/29/16
Dr.YanjunQi/UVACS6316/f15
55
Summary:EMAlgorithm• AwayofmaximizinglikelihoodfuncGonforlatentvariablemodels.
FindsMLEofparameterswhentheoriginal(hard)problemcanbebrokenupintotwo(easy)pieces:1. EsGmatesome“missing”or“unobserved”datafromobserveddata
andcurrentparameters.2. Usingthis“complete”data,findthemaximumlikelihoodparameter
esGmates.
• Alternatebetweenfillinginthelatentvariablesusingthebestguess(posterior)andupdaGngtheparametersbasedonthisguess:– E-step:– M-step:
• IntheM-stepweopGmizealowerboundonthelikelihood.IntheE-stepweclosethegap,makingbound=likelihood.
),(maxarg tq
t qFq θ=+1
),(maxarg ttt qF θθθ
11 ++ =
HowEMopGmizeILL?
11/29/16
Dr.YanjunQi/UVACS6316/f15
56
11/29/16
Dr.YanjunQi/UVACS6316/f15
57
AReportCardforEM• SomegoodthingsaboutEM:
– nolearningrate(step-size)parameter– automaGcallyenforcesparameterconstraints– veryfastforlowdimensions– eachiteraGonguaranteedtoimprovelikelihood– CallsinferenceandfullyobservedlearningassubrouGnes.
• SomebadthingsaboutEM:
– cangetstuckinlocalminima– canbeslowerthanconjugategradient(especiallynearconvergence)
– requiresexpensiveinferencestep– isamaximumlikelihood/MAPmethod
References
• BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides
• TheEMAlgorithmandExtensionsbyGeoffreyJ.MacLauchlan,ThriyambakamKrishnan
11/29/16 58
Dr.YanjunQi/UVACS6316/f15