finite mixture models and expectation maximizationcse802/em_802.pdf · review of maximum likelihood...
TRANSCRIPT
![Page 1: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/1.jpg)
Finite Mixture Models and Expectation
Maximization
Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain
and Dr. Rong Jin
![Page 2: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/2.jpg)
Recall: The Supervised Learning Problem
2
� Given a set of n samples X = {(xi , yi)}, i = 1,…,n
� Chapter 3 of DHS
� Assume examples in each class come from a parameterized Gaussian density
� Estimate the parameters (mean, variance) of the
Gaussian density for each class, and use them for classification
� Estimation uses Maximum Likelihood approach.
![Page 3: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/3.jpg)
Review of Maximum Likelihood� Given n i.i.d. examples from a density p(x;θ), with
known form p and unknown parameter θ.
� Goal: estimate θ, denoted by , such that the
observed data is most likely to be from the distribution with that θ.
� Steps involved:
� Write the likelihood of the observed data.
� Maximize the likelihood with respect to the parameter.
3
θ
![Page 4: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/4.jpg)
Example: 1D Gaussian Distribution
4
-1 -0.5 0 0.5 1 1.5 2 2.5 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
Maximum Likelihood Estimation of Mean of a Gaussian Distribution
True Density
MLE from 10 Samples
MLE from 100 samples
![Page 5: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/5.jpg)
5
Example: 2D Gaussian Distribution
x
y
2D Parameter Estimation for Gaussian Distribution
-2 -1 0 1 2 3 4
0
1
2
3
4
5
Blue: True density
Red : Estimated from
50 examples.
![Page 6: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/6.jpg)
� A single Gaussian may not accurately model the classes.
� Find subclasses in handwritten “online” characters (122,000
characters written by 100 writers)
� Performance improves by modeling subclasses
Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002
Multimodal Class Distributions
![Page 7: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/7.jpg)
Multimodal Classes
7
A single Gaussian distribution may not model the classes accurately.
Handwritten ‘f’ vs ‘y’
classification task.
![Page 8: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/8.jpg)
An extreme example of multimodal classes
8
Red vs. Blue classification.
The classes are well separated.
However, incorrect model
assumptions result in high
classification error.
0 2 4 6 8 10 120
1
2
3
4
5
6
7
8
9
10
x
y
Limitations of Unimodal class modelling
The ‘red’ class is a “mixture of two Gaussian distributions”
There is no class label information, when modeling the
density of just the red class.
![Page 9: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/9.jpg)
9
Finite mixtures
k random sources, probability density functions fi(x), i=1,…,k
X random variable
fk(x)
fi(x)
f2(x)
f1(x)
Choose at random
![Page 10: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/10.jpg)
10
Example: 3 species (Iris)
Finite mixtures
![Page 11: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/11.jpg)
11∑
=
α=k
1i
ii )x(f
Choose at random,
Prob.(source i) = αi
Finite mixtures
X random variable
fk(x)
fi(x)
f2(x)
f1(x)
f (x|source i) = fi (x)Conditional:
f (x and source i) = fi (x) αiJoint:
f(x) = Unconditional: f (x and source i) ∑sourcesall
![Page 12: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/12.jpg)
12
Finite mixtures
∑=
α=k
1i
ii )x(f)x(f
Component densities
Parameterized components (e.g., Gaussian):
∑=
θα=Θk
1i
ii )|x(f)|x(f
},...,,,θ,...,θ,θ{ k21k21 ααα=Θ
Mixing probabilities: ∑=
=αk
1i
i 10i ≥α and
)θ|x(f)x(f ii =
![Page 13: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/13.jpg)
13
Gaussian mixtures
)|x(f iθ Gaussian
Arbitrary covariances: ),|x(N)|x(f iii Cµ=θ
Common covariance: ),|x(N)|x(f ii Cµ=θ
},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ
},...,,,,...,,{ 1k21k21 −ααα=Θ Cµµµ
![Page 14: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/14.jpg)
14
Mixture fitting / estimation
Data: n independent observations, }x,...,x,x{ )n()2()1(=x
Goals: estimate the parameter set ,Θ
Example:
- How many species ? Mean of each species ?
- Which points belong to each species ?
maybe “classify the observations”
Classified data (classes unknown) Observed data
![Page 15: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/15.jpg)
15
Gaussian mixtures (d=1), an example
21 −=µ
31 =σ
6.01 =α
42 =µ
22 =σ
3.02 =α
73 =µ
1.03 =σ
1.03 =α
![Page 16: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/16.jpg)
16
Gaussian mixtures, an R2 example
k = 3
−=
4
03µ
=
10
013C
=
3
32µ
−
−=
21
122C
−=
4
41µ
=
80
022C
(1500 points)
![Page 17: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/17.jpg)
17
Uses of mixtures in pattern recognition
Unsupervised learning (model-based clustering):
- each component models one cluster
- clustering = mixture fitting
Goals:
- find the classes,
- classify the points
Observations:
- unclassified points
![Page 18: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/18.jpg)
18
Uses of mixtures in pattern recognition
Mixtures can approximate arbitrary densities
Good to represent class conditional densities in supervised learning
Example:
- two strongly non-Gaussian
classes.
- Use mixtures to model each
class-conditional density.
![Page 19: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/19.jpg)
Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002
� Find subclasses (lexemes)
� Eg. “online” characters
� Performance improves by modeling subclasses
122,000 characters written by 100 writers
Uses of mixtures in pattern recognition
![Page 20: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/20.jpg)
20
Fitting mixtures
n independent observations
}x,...,x,x{ )n()2()1(=x
Maximum (log)likelihood (ML) estimate of Θ:
),x(Lmaxargˆ Θ=ΘΘ
∏=
Θ=Θn
1j
)j( )|x(flog),x(L ∑ ∑= =
θα=n
1j
k
1i
i
)j(
i )|(flog x
mixture
ML estimate has no closed-form solution
![Page 21: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/21.jpg)
21
Gaussian mixtures: A peculiar type of ML
},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ
Maximum (log)likelihood (ML) estimate of Θ:
),x(Lmaxargˆ Θ=ΘΘ
Subject to: definitepositiveiC
∑=
=αk
1i
i 10i ≥α and
Unusual goal: a “good” local maximum
There is no global maximum.
Problem: the likelihood function is unbounded as 0)det( i →C
![Page 22: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/22.jpg)
22
A Peculiar type of ML problem
Example: a 2-component Gaussian mixture
2
)x(
σ2
)x(
2
2
21
22
21
21
e2
1e
σ2),σ,µ,µ|x(f
µ−−
µ−−
π
α−+
π
α=α
}x,...,x,x{ n21Some data points:
∑=
µ−−
+
π
α−+
π
α=Θ
n
2j
2
)x(
2log(...)e
2
1
σ2log),x(L
221
0σas, 2 →∞→
11 xµ =
![Page 23: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/23.jpg)
23
Fitting mixtures: a missing data problem
ML estimate has no closed-form solution
Missing data problem:
Standard alternative: expectation-maximization (EM) algorithm
[ ] [ ] T)j(
k
)j(
2
)j(
1
)j( 0...010...0,z...,,z,z ==z
“1” at position i ⇔ x( j) generated by component i
}x,...,x,x{ )n()2()1(=xObserved data:
Missing labels (“colors”)
Missing data: },...,,{ )n()2()1(zzzz =
![Page 24: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/24.jpg)
24
Fitting mixtures: a missing data problem
}x,...,x,x{ )n()2()1(=xObserved data:
Missing data: },...,,{ )n()2()1(zzzz =
Complete log-likelihood function:
( )∑∑= =
θα=Θn
1j
k
1i
i
)j(
ii
)j(
ic )|(flogz),,(L xzx
)|z,x(flog )j()j( Θ
[ ])j(
k
)j(
1
)j( z...,,z=z
k-1 zeros,
one “ 1”
In the presence of both x and z, Θ would be easy to estimate,
…but z is missing.
![Page 25: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/25.jpg)
25
The EM algorithm
The E-step: compute the expected value of ),,(Lc Θzx
Iterative procedure: ,...ˆ,ˆ,...,ˆ,ˆ )1t()t()1()0( +ΘΘΘΘ
)ˆ,(Q]ˆ,x|),,(L[E )t()t(
c ΘΘ≡ΘΘzx
The M-step: update parameter estimates
)ˆ,(Qmaxargˆ )t()1t( ΘΘ=ΘΘ
+
Under mild conditions: ∞→
→Θt
)t(ˆ local maximum of ),(L Θx
![Page 26: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/26.jpg)
26
The EM algorithm: the Gaussian case
]ˆ,|),,(L[E)ˆ,(Q )t(
cZ
)t( ΘΘ≡ΘΘ xzx
)],ˆ,|[E,(L )t(
c ΘΘ= xzx
The E-step:
Because ),,(Lc Θzx is linear in z
Binary variable
=Θ ]ˆ,|z[E )t()j(
i x
→)t,j(
iw Estimate, at iteration t, of the probability that x( j) was
produced by component i
“Soft” probabilistic assignment
}ˆ,|1zPr{ )t()j()j(
i Θ= x)t,j(
ik
1n
)t(
n
)j(
n
)t(
i
)j(
i w
)ˆ|x(fˆ
)ˆ|x(fˆ≡
θα
θα=
∑=
Bayes law
![Page 27: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/27.jpg)
27
The EM algorithm: the Gaussian case
→)t,j(
iw Estimate, at iteration t, of
the probability that x( j) was produced by component i
Result of the E-step:
The M-step: ∑=
+ =αn
1j
)t,j(
i
)1t(
i wn
1ˆ
∑
∑
=
=+ =µn
1j
)t,j(
i
n
1j
)j()t,j(
i
)1t(
i
w
xw
ˆ
∑
∑
=
=
++
+
µ−µ−
=n
1j
)t,j(
i
n
1j
T)1t(
i
)j()1t(
i
)j()t,j(
i
)1t(
i
w
)ˆx()ˆx(w
C
![Page 28: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/28.jpg)
28
Difficulties with EM
It is a local (greedy) algorithm (likelihood never dcreases)
Initialization dependent
74 iterations 270 iterations
![Page 29: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/29.jpg)
29
Add a penalty term to the objective function, which increases
with the number of clusters
Start with a large number of clusters
Modify the M-step to include a “killer criterion” which removes
components satisfying certain criterion
Finally, choose the number of components, resulting in with the
largest objective function value (likelihood - penalty).
Automatically deciding the number of components
![Page 30: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/30.jpg)
30
Example
Same as in [Ueda and Nakano, 1998].
![Page 31: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/31.jpg)
31
Example
k = 4
n = 1200
4
1m =α
C = I
=µ
0
01
=µ
4
01
=µ
0
41
=µ
4
41
kmax = 10
![Page 32: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/32.jpg)
32
Example
Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].
![Page 33: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/33.jpg)
33
An example with overlapping components
Example
![Page 34: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/34.jpg)
34
The iris (4-dim.) data-set: 3 components correctly identified
![Page 35: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/35.jpg)
35
Another supervised learning example
Problem: learn to classify textures, from 19 Gabor features.
-Fit Gaussian mixtures to 800 randomly located feature vectors
from each class/texture.
- Four classes:
-Test on the remaining data.
0.0155Quadratic disriminant
0.0185Linear discriminant
0.0074Mixture-based
Error rate
![Page 36: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/36.jpg)
36
2-d projection of the texture
data and the obtained mixtures
Resulting decision regions
![Page 37: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/37.jpg)
37
Properties of EM
EM is extremely popular because of the following properties:
•Easy to implement
•Guarantees the likelihood increases monotonically (why?)
•Guarantees the convergence of the solution to a stationary point
i.e., local maxima (why?).
Limitations of EM
•resulting solution depends highly on the initialization
•Could be slow in several cases compared to direct optimization
methods (e.g., Iterative scaling)
![Page 38: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/38.jpg)
1 2( , )l θ θ
• Start with initial guess
0 01 2,θ θ
0 01 2,θ θ
EM as lower bound optimization
![Page 39: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/39.jpg)
• Start with initial guess
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
Touch
Point
EM as lower bound optimization
![Page 40: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/40.jpg)
• Start with initial guess
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution
that maximizes 1 2( , )Q θ θ
{ }1 11 2,θ θ
EM as lower bound optimization
![Page 41: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/41.jpg)
• Start with initial guess
1 1
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2( , )l θ θ
• Come up with a lower bounded
{ }0 01 2,θ θ
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution
that maximizes
• Repeat the procedure
1 2( , )Q θ θ
{ }1 11 2,θ θ { }2 2
1 2,θ θ
EM as lower bound optimization
![Page 42: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/42.jpg)
Optimal
Point
1 2( , )l θ θ
{ }0 01 2,θ θ { }1 1
1 2,θ θ { }2 21 2, , ...θ θ
• Start with initial guess
• Come up with a lower bounded
{ }1 2 0,θ θ
0 0
1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +
1 2
0 0
1 1 2 2
( , ) is a concave function
Touch point: ( , ) 0
Q
Q
θ θ
θ θ θ θ= = =
• Search the optimal solution that
maximizes
• Repeat the procedure
• Converge to the local optimal
1 2( , )Q θ θ
EM as lower bound optimization
![Page 43: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown](https://reader033.vdocuments.us/reader033/viewer/2022050611/5fb1f3466523b0367554a879/html5/thumbnails/43.jpg)
43
Summary
• Expectation-Maximization algorithm
• E step: Compute expected complete data likelihood
• Mstep: Maximize the likelihood to find parameters
• Can be used with any model with hidden (latent) variables
• Hidden variables can be natural to the model or can be
artificially introduced.
• Makes the parameter estimation simpler, and efficient
• EM algorithm can be explained from many perspectives
• Bound optimization
• Proximal point optimization, etc
• Several generalizations/specializations exist
• Easy to implement, and is widely used!