Gaussian Mixture Models
David Rosenberg, Brett Bernstein
New York University
April 26, 2017
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 1 / 42
Intro Question
Intro Question
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 2 / 42
Intro Question
Intro Question
Suppose we begin with a dataset D= {x1, . . . ,xn}⊆ R2 and we runk-means (or k-means++) to obtain k cluster centers. Below we havedrawn the cluster centers. If we are given a new x ∈ R2, we can assign it alabel based on which cluster center is closest. What regions of the planebelow correspond to each possible labeling?
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
0.2
0.4
0.6
0.8
1
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 3 / 42
Intro Question
Intro Solution
Note that each cell is disjoint (except for the boarders), and convex.This can be thought of as a limitation of k-means: neither will be truefor GMMs.
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
0.2
0.4
0.6
0.8
1
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 4 / 42
Gaussian Mixture Models
Gaussian Mixture Models
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 5 / 42
Gaussian Mixture Models
Yesterday's Intro Question
Consider the following probability model for generating data.
1 Roll a weighted k-sided die to choose a label z ∈ {1, . . . ,k}. Let πdenote the PMF for the die.
2 Draw x ∈ Rd randomly from the multivariate normal distributionN(µz ,Σz).
Solve the following questions.
1 What is the joint distribution of x ,z given π and the µz ,Σz values?
2 Suppose you were given the dataset D= {(x1,z1), . . . ,(xn,zn)}. Howwould you estimate the die weightings, and the µz ,Σz values?
3 How would you determine the label for a new datapoint x?
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 6 / 42
Gaussian Mixture Models
Yesterday's Intro Solution
1 The joint PDF/PMF is given by
p(x ,z) = π(z)f (x ;µz ,Σz)
where
f (x ;µz ,Σz) =1√
|2πΣz |exp
(−1
2(x −µ)TΣ−1(x −µ)
).
2 We could use maximum likelihood estimation. Our estimates are
nz =∑n
i=11(zi = z)π̂(z) = nz
nµ̂z = 1
nz
∑i :zi=z xi
Σ̂z = 1nz
∑i :zi=z(xi − µ̂z)(xi − µ̂z)
T .
3 argmaxz p(x ,z)
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 7 / 42
Gaussian Mixture Models
Probabilistic Model for Clustering
Let's consider a generative model for the data.
Suppose
1 There are k clusters.2 We have a probability density for each cluster.
Generate a point as follows
1 Choose a random cluster z ∈ {1,2, . . . ,k}.2 Choose a point from the distribution for cluster Z .
The clustering algorithm is then:1 Use training data to �t the parameters of the generative model.2 For each point, choose the cluster with the highest likelihood based on
model.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 8 / 42
Gaussian Mixture Models
Gaussian Mixture Model (k = 3)
1 Choose z ∈ {1,2,3}
2 Choose x | z ∼ N (X | µz ,Σz).
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 9 / 42
Gaussian Mixture Models
Gaussian Mixture Model Parameters (k Components)
Cluster probabilities : π= (π1, . . . ,πk)
Cluster means : µ= (µ1, . . . ,µk)
Cluster covariance matrices: Σ= (Σ1, . . .Σk)
What if one cluster had many more points than another cluster?
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 10 / 42
Gaussian Mixture Models
Gaussian Mixture Model: Joint Distribution
Factorize the joint distribution:
p(x ,z) = p(z)p(x | z)
= πzN (x | µz ,Σz)
πz is probability of choosing cluster z .x | z has distribution N(µz ,Σz).z corresponding to x is the true cluster assignment.
Suppose we know all the parameters of the model.
Then we can easily compute the joint p(x ,z), and the conditionalp(z | x).
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 11 / 42
Gaussian Mixture Models
Latent Variable Model
We observe x .
In the intro problem we had labeled data, but here we don't observe z ,the cluster assignment.
Cluster assignment z is called a hidden variable or latent variable.
De�nition
A latent variable model is a probability model for which certain variablesare never observed.
e.g. The Gaussian mixture model is a latent variable model.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 12 / 42
Gaussian Mixture Models
The GMM �Inference� Problem
We observe x . We want to know z .
The conditional distribution of the cluster z given x is
p(z | x) = p(x ,z)/p(x)
The conditional distribution is a soft assignment to clusters.
A hard assignment is
z∗ = argmaxz∈{1,...,k}
p(z | x).
So if we have the model, clustering is trivial.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 13 / 42
Mixture Models
Mixture Models
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 14 / 42
Mixture Models
Gaussian Mixture Model: Marginal Distribution
The marginal distribution for a single observation x is
p(x) =
k∑z=1
p(x ,z)
=
k∑z=1
πzN (x | µz ,Σz)
Note that p(x) is a convex combination of probability densities.
This is a common form for a probability model...
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 15 / 42
Mixture Models
Mixture Distributions (or Mixture Models)
De�nition
A probability density p(x) represents a mixture distribution or mixture
model, if we can write it as a convex combination of probabilitydensities. That is,
p(x) =k∑
i=1
wipi (x),
where wi > 0,∑k
i=1wi = 1, and each pi is a probability density.
In our Gaussian mixture model, x has a mixture distribution.
More constructively, let S be a set of probability distributions:
1 Choose a distribution randomly from S .2 Sample x from the chosen distribution.
Then x has a mixture distribution.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 16 / 42
Learning in Gaussian Mixture Models
Learning in Gaussian Mixture Models
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 17 / 42
Learning in Gaussian Mixture Models
The GMM �Learning� Problem
Given data x1, . . . ,xn drawn from a GMM,
Estimate the parameters:
Cluster probabilities : π= (π1, . . . ,πk)
Cluster means : µ= (µ1, . . . ,µk)
Cluster covariance matrices: Σ= (Σ1, . . .Σk)
Once we have the parameters, we're done.
Just do �inference� to get cluster assignments.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 18 / 42
Learning in Gaussian Mixture Models
Estimating/Learning the Gaussian Mixture Model
One approach to learning is maximum likelihood
�nd parameter values that give observed data the highest likelihood.
The model likelihood for D= {x1, . . . ,xn} is
L(π,µ,Σ) =
n∏i=1
p(xi )
=
n∏i=1
k∑z=1
πzN (xi | µz ,Σz) .
As usual, we'll take our objective function to be the log of this:
J(π,µ,Σ) =
n∑i=1
log
{k∑
z=1
πzN (xi | µz ,Σz)
}
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 19 / 42
Learning in Gaussian Mixture Models
Properties of the GMM Log-Likelihood
GMM log-likelihood:
J(π,µ,Σ) =
n∑i=1
log
{k∑
z=1
πzN (xi | µz ,Σz)
}Let's compare to the log-likelihood for a single Gaussian:
n∑i=1
logN (xi | µ,Σ)
= −nd
2log (2π)−
n
2log |Σ|−
1
2
n∑i=1
(xi −µ)′Σ−1(xi −µ)
For a single Gaussian, the log cancels the exp in the Gaussian density.
=⇒ Things simplify a lot.
For the GMM, the sum inside the log prevents this cancellation.
=⇒ Expression more complicated. No closed form expression for MLE.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 20 / 42
Issues with MLE for GMM
Issues with MLE for GMM
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 21 / 42
Issues with MLE for GMM
Identi�ability Issues for GMM
Suppose we have found parameters
Cluster probabilities : π= (π1, . . . ,πk)
Cluster means : µ= (µ1, . . . ,µk)
Cluster covariance matrices: Σ= (Σ1, . . .Σk)
that are at a local minimum.
What happens if we shu�e the clusters? e.g. Switch the labels forclusters 1 and 2.
We'll get the same likelihood. How many such equivalent settings arethere?
Assuming all clusters are distinct, there are k! equivalent solutions.
Not a problem per se, but something to be aware of.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 22 / 42
Issues with MLE for GMM
Singularities for GMM
Consider the following GMM for 7 data points:
Let σ2 be the variance of the skinny component.
What happens to the likelihood as σ2→ 0?
In practice, we end up in local minima that do not have this problem.
Or keep restarting optimization until we do.
Bayesian approach or regularization will also solve the problem.From Bishop's Pattern recognition and machine learning, Figure 9.7.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 23 / 42
Issues with MLE for GMM
Gradient Descent / SGD for GMM
What about running gradient descent or SGD on
J(π,µ,Σ) = −
n∑i=1
log
{k∑
z=1
πzN (xi | µz ,Σz)
}?
Can be done � but need to be clever about it.
Each matrix Σ1, . . . ,Σk has to be positive semide�nite.
How to maintain that constraint?
Rewrite Σi =MiMTi , where Mi is an unconstrained matrix.
Then Σi is positive semide�nite.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 24 / 42
The EM Algorithm for GMM
The EM Algorithm for GMM
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 25 / 42
The EM Algorithm for GMM
MLE for GMM
From yesterday's intro questions, we know that we can solve the MLEproblem if the cluster assignments zi are known
nz =
n∑i=1
1(zi = z)
π̂(z) =nzn
µ̂z =1
nz
∑i :zi=z
xi
Σ̂z =1
nz
∑i :zi=z
(xi − µ̂z)(xi − µ̂z)T .
In the EM algorithm we will modify the equations to handle ourevolving soft assignments, which we will call responsibilities.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 26 / 42
The EM Algorithm for GMM
Cluster Responsibilities: Some New Notation
Denote the probability that observed value xi comes from cluster j by
γji = P(Z = j | X = xi ) .
The responsibility that cluster j takes for observation xi .
Computationally,
γji = P(Z = j | X = xi ) .
= p (Z = j ,X = xi )/p(x)
=πjN (xi | µj ,Σj)∑k
c=1πcN (xi | µc ,Σc)
The vector(γ1i , . . . ,γ
ki
)is exactly the soft assignment for xi .
Let nc =∑n
i=1γci be the �number� of points soft assigned to cluster
c .
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 27 / 42
The EM Algorithm for GMM
EM Algorithm for GMM: Overview
If we know π and µj ,Σj for all j then we can easily �nd
γji = P(Z = j | X = xi ).
If we know the (soft) assignments, we can easily �nd estimates for π,µj ,Σj for all j .
Repeatedly alternate the previous 2 steps.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 28 / 42
The EM Algorithm for GMM
EM Algorithm for GMM: Overview
1 Initialize parameters µ,Σ,π.2 �E step�. Evaluate the responsibilities using current parameters:
γji =
πjN (xi | µj ,Σj)∑kc=1πcN (xi | µc ,Σc)
, for i = 1, . . . ,n and j = 1, . . . ,k .
3 �M step�. Re-estimate the parameters using responsibilities. [Comparewith intro question.]
µnewc =1
nc
n∑i=1
γci xi
Σnewc =1
nc
n∑i=1
γci (xi −µnewc )(xi −µ
newc )T
πnewc =ncn,
4 Repeat from Step 2, until log-likelihood converges.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 29 / 42
The EM Algorithm for GMM
EM for GMM
Initialization
From Bishop's Pattern recognition and machine learning, Figure 9.8.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 30 / 42
The EM Algorithm for GMM
EM for GMM
First soft assignment:
From Bishop's Pattern recognition and machine learning, Figure 9.8.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 31 / 42
The EM Algorithm for GMM
EM for GMM
First soft assignment:
From Bishop's Pattern recognition and machine learning, Figure 9.8.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 32 / 42
The EM Algorithm for GMM
EM for GMM
After 5 rounds of EM:
From Bishop's Pattern recognition and machine learning, Figure 9.8.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 33 / 42
The EM Algorithm for GMM
EM for GMM
After 20 rounds of EM:
From Bishop's Pattern recognition and machine learning, Figure 9.8.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 34 / 42
The EM Algorithm for GMM
Relation to K -Means
EM for GMM seems a little like k-means.
In fact, there is a precise correspondence.
First, �x each cluster covariance matrix to be σ2I .
Then the density for each Gausian only depends on distance to themean.
As we take σ2→ 0, the update equations converge to doing k-means.
If you do a quick experiment yourself, you'll �nd
Soft assignments converge to hard assignments.Has to do with the tail behavior (exponential decay) of Gaussian.
Can use k-means++ to initialize parameters of EM algorithm.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 35 / 42
Math Prerequisites for General EM Algorithm
Math Prerequisites for General EM Algorithm
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 36 / 42
Math Prerequisites for General EM Algorithm
Jensen's Inequality
Which is larger: E[X 2] or E[X ]2?
Must be E[X 2] since Var[X ] = E[X 2]−E[X ]2 > 0.
More general result is true:
Theorem
Jensen's Inequality If f : R→ R is convex and X is a random variable then
E[f (X )]> f (E[X ]). If f is strictly convex then we have equality i�
X = E[X ] with probability 1 (i.e., X is constant).
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 37 / 42
Math Prerequisites for General EM Algorithm
Jensen's Inequality
Which is larger: E[X 2] or E[X ]2?
Must be E[X 2] since Var[X ] = E[X 2]−E[X ]2 > 0.
More general result is true:
Theorem
Jensen's Inequality If f : R→ R is convex and X is a random variable then
E[f (X )]> f (E[X ]). If f is strictly convex then we have equality i�
X = E[X ] with probability 1 (i.e., X is constant).
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 37 / 42
Math Prerequisites for General EM Algorithm
Proof of Jensen
Exercise
Suppose X can take exactly two value: x1 with probability π1 and x2 withprobability π2. Then prove Jensen's inequality.
Let's compute E[f (X )]:
E[f (X )] = π1f (x1)+π2f (x2)6 f (π1x1+π2x2) = f (E[X ]).
For the general proof, what do we know is true about all convexfunctions f : R→ R?
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 38 / 42
Math Prerequisites for General EM Algorithm
Proof of Jensen
Exercise
Suppose X can take exactly two value: x1 with probability π1 and x2 withprobability π2. Then prove Jensen's inequality.
Let's compute E[f (X )]:
E[f (X )] = π1f (x1)+π2f (x2)6 f (π1x1+π2x2) = f (E[X ]).
For the general proof, what do we know is true about all convexfunctions f : R→ R?
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 38 / 42
Math Prerequisites for General EM Algorithm
Proof of Jensen
1 Let e = E[X ]. (Remember e is just a number.)
2 Since f has a subgradient at e, there is an underestimating lineg(x) = ax +b that passes through the point (e, f (e)).
3 Then we have
E[f (X )] > E[g(X )]
= E[aX +b]
= aE[X ]+b
= ae+b
= f (e)
= f (E[X ]).
4 If f is strictly convex then f = g at exactly 1 point, so equality i� X isconstant.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 39 / 42
Math Prerequisites for General EM Algorithm
KL-Divergence
Let p(x) and q(x) be probability mass functions (PMFs) on X.
We want to measure how di�erent they are.
The Kullback-Leibler or �KL� Divergence is de�ne by
KL(p‖q) =∑x∈X
p(x) logp(x)
q(x).
(Assumes absolute continuity : q(x) = 0 implies p(x) = 0.)
Can also write
KL(p‖q) = Ex∼p logp(x)
q(x).
Note, the KL-divergence is not symmetric and doesn't satisfy thetriangle inequality.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 40 / 42
Math Prerequisites for General EM Algorithm
Gibbs' Inequality
Theorem
Gibbs' Inequality Let p(x) and q(x) be PMFs on X. Then
KL(p‖q)> 0,
with equality i� p(x) = q(x) for all x ∈ X.
Since
KL(p‖q) = Ep
[− log
(q(x)
p(x)
)],
this is screaming for Jensen's inequality.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 41 / 42
Math Prerequisites for General EM Algorithm
Gibbs' Inequality: Proof
KL(p‖q) = Ep
[− log
(q(x)
p(x)
)]> − log
(Ep
[q(x)
p(x)
])
= − log
∑x :p(x)>0
p(x)q(x)
p(x)
= − log
(∑x
q(x)
)= − log1= 0.
Since − log is strictly convex, we have equality i� q/p is constant, i.e.,q = p.
David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 42 / 42