Gaussian Mixture Models

David Rosenberg, Brett Bernstein

New York University

April 26, 2017

Intro Question

Suppose we begin with a dataset D= {x1, . . . ,xn}⊆ R2 and we runk-means (or k-means++) to obtain k cluster centers. Below we havedrawn the cluster centers. If we are given a new x ∈ R2, we can assign it alabel based on which cluster center is closest. What regions of the planebelow correspond to each possible labeling?

-0.2 0 0.2 0.4 0.6 0.8 1 1.2







Intro Solution

Note that each cell is disjoint (except for the boarders), and convex.This can be thought of as a limitation of k-means: neither will be truefor GMMs.

-0.2 0 0.2 0.4 0.6 0.8 1 1.2







Gaussian Mixture Models

Gaussian Mixture Models

Gaussian Mixture Models

Yesterday's Intro Question

Consider the following probability model for generating data.

1 Roll a weighted k-sided die to choose a label z ∈ {1, . . . ,k}. Let πdenote the PMF for the die.

2 Draw x ∈ Rd randomly from the multivariate normal distributionN(µz ,Σz).

Solve the following questions.

1 What is the joint distribution of x ,z given π and the µz ,Σz values?

2 Suppose you were given the dataset D= {(x1,z1), . . . ,(xn,zn)}. Howwould you estimate the die weightings, and the µz ,Σz values?

3 How would you determine the label for a new datapoint x?

Gaussian Mixture Models

Yesterday's Intro Solution

1 The joint PDF/PMF is given by

p(x ,z) = π(z)f (x ;µz ,Σz)


f (x ;µz ,Σz) =1√

|2πΣz |exp


2(x −µ)TΣ−1(x −µ)


2 We could use maximum likelihood estimation. Our estimates are

nz =∑n

i=11(zi = z)π̂(z) = nz

nµ̂z = 1


∑i :zi=z xi

Σ̂z = 1nz

∑i :zi=z(xi − µ̂z)(xi − µ̂z)

T .

3 argmaxz p(x ,z)

Gaussian Mixture Models

Probabilistic Model for Clustering

Let's consider a generative model for the data.


1 There are k clusters.2 We have a probability density for each cluster.

Generate a point as follows

1 Choose a random cluster z ∈ {1,2, . . . ,k}.2 Choose a point from the distribution for cluster Z .

The clustering algorithm is then:1 Use training data to �t the parameters of the generative model.2 For each point, choose the cluster with the highest likelihood based on


Gaussian Mixture Models

Gaussian Mixture Model (k = 3)

1 Choose z ∈ {1,2,3}

2 Choose x | z ∼ N (X | µz ,Σz).

Gaussian Mixture Models

Gaussian Mixture Model Parameters (k Components)

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

What if one cluster had many more points than another cluster?

Gaussian Mixture Models

Gaussian Mixture Model: Joint Distribution

Factorize the joint distribution:

p(x ,z) = p(z)p(x | z)

= πzN (x | µz ,Σz)

πz is probability of choosing cluster z .x | z has distribution N(µz ,Σz).z corresponding to x is the true cluster assignment.

Suppose we know all the parameters of the model.

Then we can easily compute the joint p(x ,z), and the conditionalp(z | x).

Gaussian Mixture Models

Latent Variable Model

We observe x .

In the intro problem we had labeled data, but here we don't observe z ,the cluster assignment.

Cluster assignment z is called a hidden variable or latent variable.


A latent variable model is a probability model for which certain variablesare never observed.

e.g. The Gaussian mixture model is a latent variable model.

Gaussian Mixture Models

The GMM "Inference" Problem

We observe x . We want to know z .

The conditional distribution of the cluster z given x is

p(z | x) = p(x ,z)/p(x)

The conditional distribution is a soft assignment to clusters.

A hard assignment is

z∗ = argmaxz∈{1,...,k}

p(z | x).

So if we have the model, clustering is trivial.

Mixture Models

Mixture Models

Mixture Models

Gaussian Mixture Model: Marginal Distribution

The marginal distribution for a single observation x is

p(x) =


p(x ,z)



πzN (x | µz ,Σz)

Note that p(x) is a convex combination of probability densities.

This is a common form for a probability model...

Mixture Models

Mixture Distributions (or Mixture Models)


A probability density p(x) represents a mixture distribution or mixture

model, if we can write it as a convex combination of probabilitydensities. That is,

p(x) =k∑


wipi (x),

where wi > 0,∑k

i=1wi = 1, and each pi is a probability density.

In our Gaussian mixture model, x has a mixture distribution.

More constructively, let S be a set of probability distributions:

1 Choose a distribution randomly from S .2 Sample x from the chosen distribution.

Then x has a mixture distribution.

Learning in Gaussian Mixture Models

Learning in Gaussian Mixture Models

Learning in Gaussian Mixture Models

The GMM "Learning" Problem

Given data x1, . . . ,xn drawn from a GMM,

Estimate the parameters:

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

Once we have the parameters, we're done.

Just do �inference� to get cluster assignments.

Learning in Gaussian Mixture Models

Estimating/Learning the Gaussian Mixture Model

One approach to learning is maximum likelihood

�nd parameter values that give observed data the highest likelihood.

The model likelihood for D= {x1, . . . ,xn} is

L(π,µ,Σ) =


p(xi )




πzN (xi | µz ,Σz) .

As usual, we'll take our objective function to be the log of this:

J(π,µ,Σ) =





πzN (xi | µz ,Σz)


Learning in Gaussian Mixture Models

Properties of the GMM Log-Likelihood

GMM log-likelihood:

J(π,µ,Σ) =





πzN (xi | µz ,Σz)

}Let's compare to the log-likelihood for a single Gaussian:


logN (xi | µ,Σ)

= −nd

2log (2π)−


2log |Σ|−




(xi −µ)′Σ−1(xi −µ)

For a single Gaussian, the log cancels the exp in the Gaussian density.

=⇒ Things simplify a lot.

For the GMM, the sum inside the log prevents this cancellation.

=⇒ Expression more complicated. No closed form expression for MLE.

Issues with MLE for GMM

Issues with MLE for GMM

Identifiability Issues for GMM

Suppose we have found parameters

Cluster probabilities : π= (π1, . . . ,πk)

Cluster means : µ= (µ1, . . . ,µk)

Cluster covariance matrices: Σ= (Σ1, . . .Σk)

that are at a local minimum.

What happens if we shu�e the clusters? e.g. Switch the labels forclusters 1 and 2.

We'll get the same likelihood. How many such equivalent settings arethere?

Assuming all clusters are distinct, there are k! equivalent solutions.

Not a problem per se, but something to be aware of.

Issues with MLE for GMM

Singularities for GMM

Consider the following GMM for 7 data points:

Let σ2 be the variance of the skinny component.

What happens to the likelihood as σ2→ 0?

In practice, we end up in local minima that do not have this problem.

Or keep restarting optimization until we do.

Bayesian approach or regularization will also solve the problem.From Bishop's Pattern recognition and machine learning, Figure 9.7.

Issues with MLE for GMM

Gradient Descent / SGD for GMM

What about running gradient descent or SGD on

J(π,µ,Σ) = −





πzN (xi | µz ,Σz)


Can be done � but need to be clever about it.

Each matrix Σ1, . . . ,Σk has to be positive semide�nite.

How to maintain that constraint?

Rewrite Σi =MiMTi , where Mi is an unconstrained matrix.

Then Σi is positive semide�nite.

The EM Algorithm for GMM

The EM Algorithm for GMM

The EM Algorithm for GMM


From yesterday's intro questions, we know that we can solve the MLEproblem if the cluster assignments zi are known

nz =


1(zi = z)

π̂(z) =nzn

µ̂z =1


∑i :zi=z


Σ̂z =1


∑i :zi=z

(xi − µ̂z)(xi − µ̂z)T .

In the EM algorithm we will modify the equations to handle ourevolving soft assignments, which we will call responsibilities.

The EM Algorithm for GMM

Cluster Responsibilities: Some New Notation

Denote the probability that observed value xi comes from cluster j by

γji = P(Z = j | X = xi ) .

The responsibility that cluster j takes for observation xi .


γji = P(Z = j | X = xi ) .

= p (Z = j ,X = xi )/p(x)

=πjN (xi | µj ,Σj)∑k

c=1πcN (xi | µc ,Σc)

The vector(γ1i , . . . ,γ


)is exactly the soft assignment for xi .

Let nc =∑n

i=1γci be the �number� of points soft assigned to cluster

c .

The EM Algorithm for GMM

EM Algorithm for GMM: Overview

If we know π and µj ,Σj for all j then we can easily �nd

γji = P(Z = j | X = xi ).

If we know the (soft) assignments, we can easily �nd estimates for π,µj ,Σj for all j .

Repeatedly alternate the previous 2 steps.

The EM Algorithm for GMM

EM Algorithm for GMM: Overview

1 Initialize parameters µ,Σ,π.2 �E step�. Evaluate the responsibilities using current parameters:

γji =

πjN (xi | µj ,Σj)∑kc=1πcN (xi | µc ,Σc)

, for i = 1, . . . ,n and j = 1, . . . ,k .

3 �M step�. Re-estimate the parameters using responsibilities. [Comparewith intro question.]

µnewc =1



γci xi

Σnewc =1



γci (xi −µnewc )(xi −µ

newc )T

πnewc =ncn,

4 Repeat from Step 2, until log-likelihood converges.

The EM Algorithm for GMM

EM for GMM


From Bishop's Pattern recognition and machine learning, Figure 9.8.

The EM Algorithm for GMM

EM for GMM

First soft assignment:

From Bishop's Pattern recognition and machine learning, Figure 9.8.

The EM Algorithm for GMM

EM for GMM

First soft assignment:

From Bishop's Pattern recognition and machine learning, Figure 9.8.

The EM Algorithm for GMM

EM for GMM

After 5 rounds of EM:

From Bishop's Pattern recognition and machine learning, Figure 9.8.

The EM Algorithm for GMM

EM for GMM

After 20 rounds of EM:

From Bishop's Pattern recognition and machine learning, Figure 9.8.

The EM Algorithm for GMM

Relation to K-Means

EM for GMM seems a little like k-means.

In fact, there is a precise correspondence.

First, �x each cluster covariance matrix to be σ2I .

Then the density for each Gausian only depends on distance to themean.

As we take σ2→ 0, the update equations converge to doing k-means.

If you do a quick experiment yourself, you'll �nd

Soft assignments converge to hard assignments.Has to do with the tail behavior (exponential decay) of Gaussian.

Can use k-means++ to initialize parameters of EM algorithm.

Math Prerequisites for General EM Algorithm

Math Prerequisites for General EM Algorithm

Math Prerequisites for General EM Algorithm

Jensen's Inequality

Which is larger: E[X 2] or E[X ]2?

Must be E[X 2] since Var[X ] = E[X 2]−E[X ]2 > 0.

More general result is true:


Jensen's Inequality If f : R→ R is convex and X is a random variable then

E[f (X )]> f (E[X ]). If f is strictly convex then we have equality i�

X = E[X ] with probability 1 (i.e., X is constant).

Math Prerequisites for General EM Algorithm

Jensen's Inequality

Which is larger: E[X 2] or E[X ]2?

Must be E[X 2] since Var[X ] = E[X 2]−E[X ]2 > 0.

More general result is true:


Jensen's Inequality If f : R→ R is convex and X is a random variable then

E[f (X )]> f (E[X ]). If f is strictly convex then we have equality i�

X = E[X ] with probability 1 (i.e., X is constant).

Math Prerequisites for General EM Algorithm

Proof of Jensen


Suppose X can take exactly two value: x1 with probability π1 and x2 withprobability π2. Then prove Jensen's inequality.

Let's compute E[f (X )]:

E[f (X )] = π1f (x1)+π2f (x2)6 f (π1x1+π2x2) = f (E[X ]).

For the general proof, what do we know is true about all convexfunctions f : R→ R?

David Rosenberg, Brett Bernstein (New York University)DS-GA 1003 April 26, 2017 38 / 42

Math Prerequisites for General EM Algorithm

Proof of Jensen


Suppose X can take exactly two value: x1 with probability π1 and x2 withprobability π2. Then prove Jensen's inequality.

Let's compute E[f (X )]:

E[f (X )] = π1f (x1)+π2f (x2)6 f (π1x1+π2x2) = f (E[X ]).

For the general proof, what do we know is true about all convexfunctions f : R→ R?

Math Prerequisites for General EM Algorithm

Proof of Jensen

1 Let e = E[X ]. (Remember e is just a number.)

2 Since f has a subgradient at e, there is an underestimating lineg(x) = ax +b that passes through the point (e, f (e)).

3 Then we have

E[f (X )] > E[g(X )]

= E[aX +b]

= aE[X ]+b

= ae+b

= f (e)

= f (E[X ]).

4 If f is strictly convex then f = g at exactly 1 point, so equality i� X isconstant.

Math Prerequisites for General EM Algorithm

KL Divergence


Let p(x) and q(x) be probability mass functions (PMFs) on X.

We want to measure how di�erent they are.

The Kullback-Leibler or �KL� Divergence is de�ne by

KL(p‖q) =∑x∈X

p(x) logp(x)


(Assumes absolute continuity : q(x) = 0 implies p(x) = 0.)

Can also write

KL(p‖q) = Ex∼p logp(x)


Note, the KL-divergence is not symmetric and doesn't satisfy thetriangle inequality.

Math Prerequisites for General EM Algorithm

Gibbs' Inequality


Gibbs' Inequality Let p(x) and q(x) be PMFs on X. Then

KL(p‖q)> 0,

with equality i� p(x) = q(x) for all x ∈ X.


KL(p‖q) = Ep

[− log




this is screaming for Jensen's inequality.

Math Prerequisites for General EM Algorithm

Gibbs' Inequality: Proof

KL(p‖q) = Ep

[− log



)]> − log





= − log

∑x :p(x)>0



= − log



)= − log1= 0.

Since − log is strictly convex, we have equality i� q/p is constant, i.e.,q = p.

