Download - 18-661 Introduction to Machine Learning
18-661 Introduction to Machine Learning
The EM Algorithm
Spring 2020
ECE – Carnegie Mellon University
Announcements
• HW 6 is due on Sunday, April 12.
• HW 7 will be released later this week.
• HW 7 will be entirely focused on programming, with the goal of
letting you gain experience in implementing ML models.
• Lecture on April 13 will be a PyTorch tutorial by Ritwick. Please
attend if you are not familiar with PyTorch. You will need to know
your way around PyTorch for HW 7.
• You cannot drop HW 7. Its score will be weighted equally with each
of your best 5 scores from the first six homeworks to determine your
overall homework grade.
• The final exam will be on Wednesday, April 29. It will have online
and take-home parts.
1
Outline
1. Recap: Gaussian Mixture Models
2. EM Algorithm
2
Recap: Gaussian Mixture Models
Probabilistic interpretation of clustering?
How can we model p(x) to reflect our intuition that points stay close to
their cluster centers?
(b)
0 0.5 1
0
0.5
1• Points seem to form 3 clusters
• We cannot model p(x) with
simple and known distributions
• E.g., the data is not a Gaussian
b/c we have 3 distinct
concentrated regions...
3
Gaussian mixture models: Intuition
(a)
0 0.5 1
0
0.5
1
• Key idea: Model each region
with a distinct distribution
• Can use Gaussians — Gaussian
mixture models (GMMs)
• *However*, we don’t know
cluster assignments (label),
parameters of Gaussians, or
mixture components!
• Must learn these values from
unlabeled data D = {xn}Nn=1
4
Recall: Gaussian (normal) distributions
x ∼ N(µ,Σ) f (x) =exp(− 1
2 (x− µ)>Σ−1(x− µ))√(2π)n|Σ|
5
Gaussian mixture models: Formal definition
GMM has the following density function for x
p(x) =K∑
k=1
ωkN(x |µk ,Σk)
• K : number of Gaussians — they are called mixture components
• µk and Σk : mean and covariance matrix of k-th component
• ωk : mixture weights (or priors) represent how much each component
contributes to final distribution. They satisfy 2 properties:
∀ k , ωk > 0, and∑k
ωk = 1
These properties ensure p(x) is in fact a probability density function.
6
GMM as the marginal distribution of a joint distribution
Consider the following joint distribution
p(x , z) = p(z)p(x |z)
where z is a discrete random variable taking values between 1 and K .
Denote
ωk = p(z = k)
Now, assume the conditional distributions are Gaussian distributions
p(x |z = k) = N(x |µk ,Σk)
Then, the marginal distribution of x is
p(x) =K∑
k=1
ωkN(x |µk ,Σk)
Namely, the Gaussian mixture model
7
Gaussian mixture model for clustering
8
GMMs: Example
(a)
0 0.5 1
0
0.5
1
The conditional distribution between x and z
(representing color) are
p(x |z = red) = N(x |µ1,Σ1)
p(x |z = blue) = N(x |µ2,Σ2)
p(x |z = green) = N(x |µ3,Σ3)
(b)
0 0.5 1
0
0.5
1
The marginal distribution is thus
p(x) = p(z = red)N(x |µ1,Σ1)
+ p(z = blue)N(x |µ2,Σ2)
+ p(z = green)N(x |µ3,Σ3)
9
Parameter estimation for Gaussian mixture models
The parameters in GMMs are:
θ = {ωk ,µk ,Σk}Kk=1
Let’s first consider the (unrealistic) case where we know the labels z .
Define D′ = {xn, zn}Nn=1, D = {xn}Nn=1
• D′ is the complete data
• D the incomplete data
How can we learn our parameters?
Given D′, the maximum likelihood estimation of the θ is given by
θ = arg max∑n
log p(xn, zn)
10
Parameter estimation for GMMs: Complete data
The complete likelihood is decomposable across the labels:∑n
log p(xn, zn) =∑n
log p(zn)p(xn|zn) =∑k
∑n:zn=k
log p(zn)p(xn|zn)
where we have grouped data by cluster labels zn.
Let rnk ∈ {0, 1} be a binary variable that indicates whether zn = k:∑n
log p(xn, zn) =∑k
∑n
rnk log p(z = k)p(xn|z = k)
=∑k
∑n
rnk [logωk + logN(xn|µk ,Σk)]
In the complete setting, the rnk just add to the notation, but later we will
‘relax’ these variables and allow them to take on fractional values.
11
Parameter estimation for GMMs: Complete data
The MLE is:
ωk =
∑n rnk∑
k′∑
n rnk′, µk =
1∑n rnk
∑n
rnkxn
Σk =1∑n rnk
∑n
rnk(xn − µk)(xn − µk)>
Since rnk is binary, the previous solution is nothing but:
• ωk : fraction of total data points whose cluster label zn is k
• note that∑
k′∑
n rnk′ = N
• µk : mean of all data points whose label zn is k
• Σk : covariance of all data points whose label zn is k
12
Parameter estimation for GMMs: Incomplete data
GMM Parameters
θ = {ωk ,µk ,Σk}Kk=1
Incomplete Data
Our data contains observed and unobserved data, and hence is
incomplete:
• Observed: D = {xn}• Unobserved (hidden): {zn}
Goal: Obtain the maximum likelihood estimate of θ:
θ = arg max log p(D) = arg max∑n
log p(xn)
= arg max∑n
log∑zn
p(xn, zn)
The objective function is called the incomplete log-likelihood.13
Parameter estimation for GMMs: Incomplete data
No simple way to optimize the incomplete log-likelihood...
EM algorithm provides a strategy for iteratively optimizing this function!
E-Step: ‘Guess’ values of the zn using existing values of
θ = {ωk ,µk ,Σk}Kk=1. How do we do this?
When zn is not given, we can guess it via the posterior probability (recall:
Bayes’ rule!)
p(zn = k|xn) =p(xn|zn = k)p(zn = k)
p(xn)
=p(xn|zn = k)p(zn = k)∑K
k′=1 p(xn|zn = k ′)p(zn = k ′)
=N(xn|µk ,Σk)× ωk∑K
k′=1 N(xn|µk′ ,Σk′)× ωk′
14
Estimation with soft rnk
We define rnk = p(zn = k |xn)
• Recall that rnk was previously binary
• Now it’s a “soft” assignment of xn to k-th component
• Each xn is assigned to a component fractionally according to
p(zn = k |xn)
M-Step: If we solve for the MLE of θ = {ωk ,µk ,Σk}Kk=1 given soft rnks,
we get the same expressions as before!
ωk =
∑n rnk∑
k
∑n rnk
, µk =1∑n rnk
∑n
rnkxn
Σk =1∑n rnk
∑n
rnk(xn − µk)(xn − µk)>
But remember, we’re ‘cheating’ by using θ to compute rnk !
15
EM procedure for GMM
Alternate between estimating rnk and estimating parameters θ
• Step 0: Initialize θ with some values (random or otherwise)
• Step 1: E-Step: Set rnk = p(zn = k |xn) with the current values of θ
using Bayes Rule
• Step 2: M-Step: Update θ using the rnks from Step 2 using MLE
• Step 3: Go back to Step 1.
At the end convert rnk back to binary by setting the largest rnk for point
xn to 1 and others to 0.
16
GMMs and K-means
GMMs provide probabilistic interpretation for K-means
GMMs reduce to K-means under the following assumptions (in which
case EM for GMM parameter estimation simplifies to K-means):
• Assume all Gaussians have σ2I covariance matrices
• Further assume σ → 0, so we only need to estimate µk , i.e., means
K-means is often called “hard” GMM or GMMs is called “soft” K-means
The posterior rnk provides a probabilistic assignment for xn to cluster k
17
GMMs vs. k-means
Pros/Cons
• k-means is a simpler, more straightforward method, but might not
be as accurate because of deterministic clustering
• GMMs can be more accurate, as they model more information (soft
clustering, variance), but can be more expensive to compute
• Both methods have a similar set of practical issues (having to select
k, the distance, and the initialization)
18
EM Algorithm
EM algorithm: Motivation
EM is a general procedure to estimate parameters for probabilistic models
with hidden/latent variables.
19
EM algorithm: Setup
• Suppose the model is given by a joint distribution
p(x |θ) =∑z
p(x , z |θ)
• Given incomplete data D = {xn} our goal is to compute MLE of θ:
θ = arg max `(θ) = arg max log p(D) = arg max∑n
log p(xn|θ)
= arg max∑n
log∑zn
p(xn, zn|θ)
• The objective function `(θ) is called incomplete log-likelihood
• log-sum form of incomplete log-likelihood is difficult to work with
20
EM: Principle
• EM: construct lower bound on `(θ) (E-step) and optimize it
(M-step)
• “Majorization-minimization (MM)”
• Optimizing the lower bound will hopefully optimize `(θ) too.
21
Constructing a lower bound
• If we define q(z) as a distribution over z , then
`(θ) =∑n
log∑zn
p(xn, zn|θ)
=∑n
log∑zn
q(zn)p(xn, zn|θ)
q(zn)
≥∑n
∑zn
q(zn) logp(xn, zn|θ)
q(zn)
• Last step follows from Jensen’s inequality, i.e., f (EX ) ≥ Ef (X ) for
concave function f (x) = log(x).
22
Detour: Jensen’s inequality
Jensen’s inequality: if f (·) is a convex function, then for any random
variable X
f (EX ) ≤ Ef (X )
where equality holds when f (·) is a constant function.
• Example: for f (x) = x2 which is convex:
(EX )2 ≤ EX 2 =⇒ Var(X ) = EX 2 − (EX )2 ≥ 0
• Example: for f (x) = log x which is concave:
log(EX ) ≥ E log(X ) 23
Applying Jensen’s inequality
If we define q(z) as a distribution over z , then
`(θ) =∑n
log∑zn
p(xn, zn|θ)
=∑n
log∑zn
q(zn)p(xn, zn|θ)
q(zn)︸ ︷︷ ︸f (Eq(zn)[
p(xn,zn|θ)q(zn) ])
Apply Jensen’s inequality to each term, i.e., f (EX ) ≥ Ef (X ), for concave
function f (·) = log(·). We take the expectation of X = p(xn,zn|θ)q(zn)
, a
random variable depending on zn, over the probability distribution q(zn).
`(θ) ≥∑n
∑zn
q(zn) logp(xn, zn|θ)
q(zn)︸ ︷︷ ︸Eq(zn)[f (
p(xn,zn|θ)q(zn) )]
24
Which q(z) should we choose?
(a)
0 0.5 1
0
0.5
1
• Consider the previous model where x could be from 3 regions
• We can choose q(z) as any valid distribution
• e.g., q(z = k) = 1/3 for any of 3 colors
• e.g., q(z = k) = 1/2 for red and blue, 0 for green
Which q(z) should we choose?
25
Which q(z) to choose?
`(θ) ≥∑n
∑zn
q(zn) logp(xn, zn|θ)
q(zn)
• The lower bound we derived for `(θ) holds for all choices of q(·)• We want a tight lower bound, so given some current estimate θt , we
will pick qt(·) such that our lower bound holds with equality at θt .
• We will choose a different qt(·) for each iteration t.
26
Pick qt(zn)
Pick qt(zn) so that
`(θt) =∑n
log p(xn, zn|θt) =∑n
∑zn
qt(zn) logp(xn, zn|θt)
qt(zn)
• Pick the distribution where the equality in Jensen’s inequality holds.
• Choose qt(zn) ∝ p(xn, zn|θt)!
• Since qt(·) is a distribution, we have
qt(zn) =p(xn, zn|θt)∑
k p(xn, zn = k |θt)=
p(xn, zn|θt)
p(xn|θt)= p(zn|xn;θt)
• This is the posterior distribution of zn given xn and θt
27
E-Step in GMMs
GMM Parameters
θ = {ωk ,µk ,Σk}Kk=1
Incomplete Data
Our data contains observed and unobserved data, and hence is
incomplete
• Observed: D = {xn}• Unobserved (hidden) labels: {zn}
Guess the distribution of zn with the posterior probabilities, given
estimates of θt :
qt(zn) = p(zn = k |xn;θt) =p(xn|zn = k)p(zn = k)∑K
k′=1 p(xn|zn = k ′)p(zn = k ′)
=N(xn|µt
k ,Σtk)× ωt
k∑Kk′=1 N(xn|µt
k′ ,Σtk′)× ωt
k′
28
E- and M-Steps
Our lower bound for the log-likelihood:
`(θ) ≥∑n
∑zn
p(zn|xn;θt) logp(xn, zn|θ)
p(zn|xn;θt)
Why is this called the E-Step? Because we can view it as computing the
expected (complete) log-likelihood:
Q(θ|θt) =∑n
∑zn
p(zn|xn;θt) log p(xn, zn|θ)︸ ︷︷ ︸complete log-likelihood
= Eqt
∑n
log p(xn, zn|θ)
Where did the p(zn|xn;θt) in the denominator go? It is not a function of
θ, so can be ignored.
M-Step: Maximize Q(θ|θt), i.e., θt+1 = arg maxθ Q(θ|θt)
29
EM in Pictures
30
EM in Pictures
Adapted from Jonathan Hui’s Medium tutorial.
31
Iterative and monotonic improvement
`(θ) ≥∑n
∑zn
p(zn|xn;θt) logp(xn, zn|θ)
p(zn|xn;θt)︸ ︷︷ ︸Q(θ|θt )
• We can show that `(θt+1) ≥ `(θt).
• Recall that we chose qt(·) in the E-Step such that:
`(θt) = Q(θt |θt)
• However, in the M-step, θt+1 is chosen to maximize Q(θ|θt), thus
`(θt+1) ≥ Q(θt+1|θt) = maxθ
Q(θ|θt) ≥ Q(θt |θt) = `(θt)
• Note: the EM procedure converges but only to a local optimum
32
Example: Applying EM to GMMs
What is the E-Step in GMM?
rnk = p(z = k |xn;θt)
What is the M-Step in GMM? The Q-function is
Q(θ|θ(t)) =∑n
∑k
p(z = k |xn;θt) log p(xn, z = k|θ)
=∑n
∑k
rnk log p(xn, z = k |θ)
=∑k
∑n
rnk log p(z = k)p(xn|z = k)
=∑k
∑n
rnk [logωk + logN(xn|µk ,Σk)]
We have recovered the parameter estimation algorithm for GMMs that
we previously discussed!
33
Example: Estimating height distributions
Suppose the heights of men and women follow two normal distributions
with different parameters:
Men : N(µ1, σ21) Women : N(µ2, σ
22)
Our data, xn, n = 1, 2, . . . , 5 is the heights of five people: 179, 165, 175,
185, 158 (in cm).
We are missing the labels zn = each person’s gender. Let π equal the
fraction of males in the population.
How do we estimate µ1, σ1, µ2, σ2, π?
Example taken from: http://web1.sph.emory.edu/users/hwu30/
teaching/statcomp/Notes/Lecture3_EM.pdf
34
Example: E-Step
Initialize µ01 = 175, µ0
2 = 165, σ01 = σ0
2 = 10, π0 = 0.6.
E-Step: Estimate the probability of each person being male or female.
p(zn = male|µ1, σ1, µ2, σ2, π) =0.6 exp
[−(xn−175)2
200
]0.6 exp
[−(xn−175)2
200
]+ 0.4 exp
[−(xn−165)2
200
]p(zn = female|µ1, σ1, µ2, σ2, π) = 1− p(zn = male|µ1, σ1, µ2, σ2, π)
Person 1 2 3 4 5
xi : height 179 165 175 185 158
Prob. male 0.79 0.48 0.71 0.87 0.31
We can use these probabilities to find Q(θ1|θ0).
35
Example: M-Step
M-Step: Find µ11, µ
12, σ
11 , σ
12 , π
1 that maximize Q(θ1|θ0):
µ11 =
∑5n=1 p(zn = male|θ0)xn∑5n=1 p(zn = male|θ0)
, µ12 =
∑5n=1 p(zn = female|θ0)xn∑5n=1 p(zn = female|θ0)
,
σ11 =
∑5n=1 p(zn = male|θ0)(xn − µ1
1)2∑5n=1 p(zn = male|θ0)
,
σ12 =
∑5n=1 p(zn = female|θ0)(xn − µ1
2)2∑5n=1 p(zn = female|θ0)
,
π1 =1
5
5∑n=1
p(zn = male|θ0)
Here we are using the MLE solution that we derived earlier for GMMs.
Numerically, µ11 = 176, µ1
2 = 167, σ11 = 8.7, σ1
2 = 9.2, π1 = 0.63.
36
Example: After 15 iterations...
After iteration 1:
Person 1 2 3 4 5
x1: height 179 165 175 185 158
Prob. male 0.79 0.48 0.71 0.87 0.31
Parameter estimates: µ11 = 176, µ1
2 = 167, σ11 = 8.7, σ1
2 = 9.2,
π1 = 0.63.
After iteration 15:
Person 1 2 3 4 5
x1: height 179 165 175 185 158
Prob. male 0.999997 0.0004009 0.9991 1 2.44e-06
Final estimates: µ1 = 179.6, µ2 = 161.5, σ1 = 4.1, σ2 = 3.5, π = 0.6.
37
Another example: Multinomial distributions
Suppose we are trying to model the number of people who will vote for
one of four candidates. We know that the probability of a single person
voting for each candidate is:
(p1, p2, p3, p4) =
(1
2+θ
4,
1− θ4
,1− θ
4,θ
4
).
Let Y = (y1, y2, y3, y4) denote our observation of the number of votes for
each candidate. How do we estimate θ?
• Option 1: MLE. But it is hard to optimize the log-likelihood...
log p(Y |θ) = y1 log
(1
2+θ
4
)+ (y2 + y3) log(1− θ) + y4 log θ
• Option 2: EM. We will introduce two unobserved/latent variables x0and x1, where x0 + x1 = y1. In other words, y1 combines the counts
of two categories, whose individual counts are given by x0 and x1.
We have a probability 12 of picking the x0 category, and a probability
θ4 of picking the x1 category.
38
Applying EM to multinomial distributions
Complete likelihood function:
`(θ) = log p(X ,Y |θ) = (x1 + y4) log θ + (y2 + y3) log(1− θ)
E-Step: Find Q(θ|θt) = Eqt [`(θ)], where qt is the posterior distribution
of x1 given y1, y2, y3, y4, θ. To do this, we need to find Eqt [x1]:
x t+11 = Eqt [x1] = y1
θt
412 + θt
4
.
M-Step: Find θ that maximizes
Q(θ|θt) =
(y1
θt
412+
θt
4
+ y4
)log θ + (y2 + y3) log(1− θ).
θt+1 =x t+11 + y4
x t+11 + y4 + y2 + y3
39
What does this look like in practice?
Observe Y = (125, 18, 20, 34) and initialize θ0 = 0.5.
40
Applications of EM
EM is a general method to deal with hidden data; we have studied it in
the context of hidden labels (unsupervised learning). Common
applications include:
• Filling in missing data in a sample
• Discovering the value of latent model variables
• Estimating parameters of finite mixture models
• As an alternative to direct maximum likelihood estimation
41
You should know . . .
• EM is a general procedure for maximizing a likelihood with latent
(unobserved) variables
• The two steps of EM:
• (1) Estimating unobserved data from observed data and current
parameters
• (2) Using this “complete” data to find the maximum likelihood
parameter estimates
• Pros: Guaranteed to converge, no parameters to tune (e.g.,
compared to gradient methods)
• Cons: Can get stuck in local optima, can be expensive
42