mlpr: clustering, mixture models, and the em algorithm - …€¦ · mlpr: clustering, mixture...
TRANSCRIPT
-
MLPR: Clustering, Mixture Models, and the EMAlgorithm
Machine Learning and Pattern Recognition
Charles Sutton
School of Informatics, University of Edinburgh
1 / 36
-
Overview
I Up until now: supervised learning. Training data containsexamples of inputs and outputs
I Now we will see a few examples of unsupervised learning.
I Unsupervised methods: The algorithm is given a set oftraining inputs but no outputs. Just “find some patterns”
I This lecture: clustering.
I Clustering: Given a set of feature vectors {x1,x2, . . . ,xN}.Assign each data point xi to a cluster label zi ∈ {1, 2, . . . ,K}
I Sort of like classification, except that “the class labels” for thetraining data aren’t given!
I Idea is that items that are similar in feature space should beassigned to the same group
2 / 36
-
Example: Image ClusteringInput: Each xi represents one imageOutput: zi ∈ {1, 2, . . .K} a cluster labele.g., Input:
Source: http://137.189.35.203/WebUI/CatDatabase/catData.html
3 / 36
http://137.189.35.203/WebUI/CatDatabase/catData.html
-
Example: Image Clusteringe.g., Outputs:
1 1 1 1
1
1
1
2 2
2
2
2
2
2
1
33 3
3
3
33 3 3 3
4
44
I The circles give a potential assignment to the ziI Notice that we did not get a group label in the training
inputs! This is the key difference from classification.4 / 36
-
Examples: Time Series ClusteringInput: Each x1 . . .xN is one of the lines in figure below.xi = (xi1, xi2, . . . xi6). Each xit is one of the points (circles) infigure
Example input:
0 9.5 11.5 13.5 15.5 18.5 20.5−5
−4
−3
−2
−1
0
1
2
3
4
5
time
ge
ne
s
yeast microarray data
}}
Cluster 1
Cluster 2
5 / 36
-
Examples: Time Series Clustering
I Details: Each line represents a gene in yeast
I We measure how much each gene is expressed (i.e., convertedinto RNA) over time, as some experimental condition ischanged
0 9.5 11.5 13.5 15.5 18.5 20.5−5
−4
−3
−2
−1
0
1
2
3
4
5
time
ge
ne
s
yeast microarray data
}}
Cluster 1
Cluster 2
6 / 36
-
Examples: Time Series Clustering
Output: Assignment of each vector to a cluster
Example output:
0 9.5 11.5 13.5 15.5 18.5 20.5−5
−4
−3
−2
−1
0
1
2
3
4
5
time
ge
ne
s
yeast microarray data
}}
Cluster 1
Cluster 2
7 / 36
-
Ill-Posed Problem
I There is no “one right answer” to a clustering problem.
I For example, could cluster cats by colour:
1 1 1 1
1
1
1
2 2
2
2
2
2
2
1
33 3
3
3
33 3 3 3
4
44
8 / 36
-
Ill-Posed ProblemI Or by location:
1
43
2 2
22
1
1
11
1 1 1 1
31
3
4
4
44
4 4 4 4 4
I Remember no free lunch from first lecture? Differentclustering methods will make different prior assumptions,which determine which of these clusterings they find.
9 / 36
-
Clustering Methods
I There are lots and lots of clustering methods:I K-means, K-mediodsI Hierarchical clustering (agglomerative or top down)I Spectral clusteringI Biclustering
I Many of these aren’t probabilistic.
I We won’t go through the above methods in this course.I Today we will do
I Mixture models
I which are commonly used for clustering, but are used for otherproblems as well (density estimation, and even supervisedlearning).
10 / 36
-
Mixture Modelling
I A mixture model is a way of making a simple density moreflexible.
I Previously, we have had models that looked like p(xi|θ)I To make a mixture model, you need to start from a base, like
Gaussians, beta, bernoulli, etc.
I From this create a “mixture of Gaussians”, “mixture ofbetas”, etc.
I The main idea will be to add a new random variable to themodel.
I We will have a joint distribution over zi and xi.
I zi is a variable that we add just to make the model moreconvenient. Called a latent variable.
11 / 36
-
Mixture Modelling
I Previously, we have had models that looked like p(xi|θ)I Now we will have one set of parameters for each clusterθ1, θ2, . . . θk
I In general, model will be
p(zi = k) = πk
p(xi|zi = k) = p(xi|θk)
I π = (π1, π2, . . . πK) represent how common the clusters are.These are called mixing weights
I The p(xi|θk) are the base distributions that we are mixingtogether
12 / 36
-
Example 1: Mixture of Gaussians
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
p(zi = k) = πk
p(xi|zi = k) = N (xi;µk,Σk)
Notice a different mean and covariance for each cluster(Above plots of the density of a MOG for K = 3)
13 / 36
-
Example 2: Mixture of Multinoullis
I Suppose each data point a bit vector, i.e., xi ∈ {0, 1}D
I Can choose p(x|z) to be a product of Bernoullis. Let onetraining instance be x = (x1, x2, . . . xD) where xj ∈ {0, 1}.We define
p(x|z = k, θk) =D∏j=1
θxjkj (1− θkj)
1−xj ,
i.e., if we are in cluster k, then the distribution over the bitvector x is a product of independent Bernoullis.
θk = (θk1, θk2, . . . , θkD) give the Bernoulli parameter for eachdimension, i.e.,
p(xj = 1|z = k, θk) = θkj
14 / 36
-
A Point about Expressivity
I Squinting at this equation
p(x|z = k, θk) =D∏j=1
θxjkj (1− θkj)
1−xj ,
we see that given z = k, all pairs of variables xi and xj areindependent.
I But marginally xi and xj are dependent under this model!
I This is because xi provides information about z, whichprovides info about xj
I For example, consider D = 2 and K = 2 ,and
θ1 = (0.99, 0.01)
θ2 = (0.01, 0.99)
π = (0.5, 0.5)
Once I see x1, I am almost certain about the value of x2.
15 / 36
-
Summary: A Point about Expressivity
I A product of independent Bernoullis is not a very expressivemodel.
I We have made the model more expressive by embedding itwithin a mixture.
I Now we can represent correlation, when we couldn’t before
I This is a general point about mixtures: The family ofdistributions in a mixture model will be larger than theoriginal models
16 / 36
-
Maximizing Likelihood
I Parameters to estimate are θ = (θ1, . . . , θK , π)
I The definition of the likelihood doesn’t change just becausewe added latent variables. It is still the probability of datagiven parameters, i.e.,
L(θ) =
N∑i=1
log p(xi|θ)
I But now we have to compute a marginal distribution to getp(xi|θ), i.e.,
L(θ) =
N∑i=1
log
K∑k=1
p(xi|zi = k, θ)p(zi = k|θ)
17 / 36
-
Likelihood Example
I For a mixture of Gaussians we would have
L(θ) =
N∑i=1
log
K∑k=1
N (xi;µk,Σk)πk
18 / 36
-
Multimodality
I The likelihood for a mixture model can be multimodal, for thesame reason as the neural network likelihood can be.
I For example, we can do label switching, i.e., permuteθ1 . . . θK , and then permute π1 . . . πK to match.
I When we run an optimization procedure, we will only find alocal optimum
I More technically, the log likelihood is not convex (can verifyby taking second derivatives)
19 / 36
-
How to apply to Clustering
I Given data {x1,x2, . . . ,xN} and a model
p(x1, . . . ,xN , z1 . . . zN |θ1 . . . θK , π)
I Maximize likelihood to get ML parameters θ̂1 . . . θ̂K , π̂
I For each training point xi, compute
rik := p(zi = k|xi, θ̂1 . . . θ̂K , π̂)
This is called the responsibility of cluster i for point k.
I Interpret the responsibilities as
I If you need xi to be assigned to exactly one cluster, takemaxk rik.
I Soft cluster == each point gets a distribution over clusters,hard clustering == each point gets exactly one cluster
20 / 36
-
Optimizing the Likelihood
I Could use any of the methods from the last lecture, e.g.,gradient descent, etc.
I But in practice we usually don’t use those for mixture models
I It turns out to be easier to apply a different optimizationalgorithm called expectation maximization (EM)
I EM is a special optimization algorithm for probabilisticmodels.
I Think of it as a competitor to gradient descent, conjugategradient, etc.
21 / 36
-
Motivation for EM
Consider 1D mixture of Gaussians. Model is:
p(x) =
K∑k=1
πk(2πσ2k)
−1/2 exp
{− 1
2σ2k(x− µk)2
}Your data look like:
xi zi1.25 ??1.073 ??1.75 ??0.234 ??0.484 ??
Difficult to figure out how to estimate πk, µk, σ2k.
22 / 36
-
Motivation for EM
Consider 1D mixture of Gaussians. Model is:
p(x) =
K∑k=1
πk(2πσ2k)
−1/2 exp
{− 1
2σ2k(x− µk)2
}Suppose someone tells you the cluster labels. Now your databecomes:
xi zi1.25 11.073 21.75 10.234 20.484 2
Now do you know how to estimate πk, µk, σ2k?
23 / 36
-
Motivation for EM
Consider 1D mixture of Gaussians. Model is:
p(x) =
K∑k=1
πk(2πσ2k)
−1/2 exp
{− 1
2σ2k(x− µk)2
}Suppose someone tells you the cluster labels. Now your databecomes:
xi zi1.25 11.073 21.75 10.234 20.484 2
Now do you know how to estimate πk, µk, σ2k?
24 / 36
-
Hard Cluster Assignment GivenYes! This is just a class-conditional Gaussian. ML is easy.
xi zi1.25 11.073 21.75 10.234 20.484 2
e.g.,
πk =
∑Ni=1 I{zi = k}
N
µk =1∑N
i=1 I{zi = k}
N∑i=1
I{zi = k}xi
σ2k =1∑N
i=1 I{zi = k}
N∑i=1
I{zi = k}(xi − µk)2
25 / 36
-
Soft Cluster Assignment Givenxi zi1.25 [0.9, 0.1]1.073 21.75 10.234 20.484 2
I Suppose instead you get a soft assignmentto clusters [ri1, ri2]
I ri1 degree that the point belongs tocluster 1, ri2 degree to cluster 2,ri1 + ri2 = 1
Intuitively we’d like to treat each point as partially belonging toboth clusters, i.e.,
πk =
∑Ni=1 rikN
µk =1∑N
i=1 rik
N∑i=1
rikxi
σ2k =1∑N
i=1 rik
N∑i=1
rik(xi − µk)2
We’re taking weighted averages rather than sample averages.26 / 36
-
Outline of EM
I EM is an algorithm for performing maximum likelihood wherethere is missing data.
I Can be used where:I Data x = (x1 . . .xN )I A model p(x, z|θ)I We want the ML solution
θ̂ = maxθ
log∑z
p(x, z|θ)
I A model of the form p(x, z|θ) is called a latent variable modelbecause we do not get to observe the random variable z in theobserved data
27 / 36
-
EM for Mixture Models
EM is an iterative algorithm, like gradient descent. At everyiteration t we have a current guess θ(t) of the parameters
I Repeat
1. E-step. Compute a soft assignment to the latent variables.Compute q(z) := p(z|x, θ(t)).
2. M-step. Reestimate the parameters based on the softassignment. Define the complete data likelihood:
`c = log p(x, z|θ)
Thenθ(t−1) ← max
θ
∑z
q(z)`c(θ)︸ ︷︷ ︸Q(θ,θ(t−1))
3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �
28 / 36
-
Specializing it to a Model
I Given a new model p(z,x|θ), may need to derive the EMalgorithm for that model
I To do this, needI Method for computing E-step: q(z) := p(z|x, θ(t))I Method for computing M-step: maxθ Q(θ, θ
(t−1))
I Example: Mixture of GaussiansI Already described p(z|x, θ(t)); this is the responsibilityI It turns out that maxθ Q(θ, θ
(t−1)) occurs at the weightedaverage estimates on the previous slide!
29 / 36
-
Example: EM on a Gaussian Mixture
I Repeat
1. E-step. Assign the points to clusters. Compute
rik = p(zi = k|xi, θ(t)k . . . θ(t)k , π
(t))
for all i ∈ 1 . . . N and k ∈ 1 . . .K2. M-step. Re-estimate the model parameters
πk =
∑Ni=1 rikN
µk =1∑N
i=1 rik
N∑i=1
rikxi
σ2k =1∑N
i=1 rik
N∑i=1
rik(xi − µk)2
3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �
30 / 36
-
−2 0 2
−2
0
2
31 / 36
-
Why does EM work?
I The M-step maximizes a functionQ(θ, θ(t)) =
∑z q(z) log p(x, z|θ)
I Compare this to the true log likelihood
L(θ) = log
[∑z
p(x, z|θ)
]
I Key point 1: Q(θ, θ(t)) + const ≤ L(θ) for all θI Key point 2: Because q arises from E-step
Q(θ(t), θ(t)) + const = log p(x|θ(t))
32 / 36
-
Why does EM work? (cont)
Q(θ,θt)
Q(θ,θt+1
)
l(θ)
θt
θt+1
θt+2
I E-step: Create a lowerbound
L(θ(t), q) = Q(θ, θ(t))+const
I M-step: Maximize thelower bound wrt θ
Means that: EM monotonically increases the likelihood at everyiteration
L(θ(t+1)) ≥ L(θ(t+1), q) ≥ L(θ(t), q) = L(θ(t))
33 / 36
-
Why does Key Point 1 hold? log is concave
α log x0 + (1− α) log x1 ≤ log(αx0 + (1− α)x1)
34 / 36
-
Lower bound on L
L(θ) = log
[∑z
p(x, z|θ)
]
= log
[∑z
q(z)p(x, z|θ)q(z)
]
≥∑z
q(z) log
[p(x, z|θ)q(z)
]Jensen’s inequality
= H(q) + Eq[log p(x, z|θ)]:= L(θ, q)
This is the lower bound. So const on the previous slides is H(q),which is constant with respect to θ.
35 / 36
-
Summary
I Mixture models provide a method for making a family ofdistributions richer
I Also widely used for performing clustering
I Maximum likelihood in mixture algorithms : use EM algorithm
36 / 36