clustering and the em algorithmlearning.eng.cam.ac.uk/.../turner/teaching/clustering.pdfwhat is...

130
Clustering and the EM algorithm Rich Turner and Jos´ e Miguel Hern´ andez-Lobato x 1 x 2

Upload: others

Post on 04-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Clustering and the EM algorithm

Rich Turner and Jose Miguel Hernandez-Lobato

x1

x2

Page 2: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is clustering?

x1

x2

network community detection

Campbell et al Social Network Analysis

image segmentation

vector quantisationgenetic clusteringanomaly detection

crime analysis

Page 3: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is clustering?

I Roughly speaking, two points belonging to the same cluster aregenerally more similar to each other or closer to each other than twopoints belonging to different clusters.

I D = {x1 . . . xN} → s = {s1 . . . sN} where sn ∈ {1 . . .K}I Unsupervised learning problem (no labels or rewards)

x1

x2

Page 4: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 5: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 6: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 7: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 8: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 9: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 10: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 11: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 12: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 13: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 14: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 15: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 16: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A first clustering algorithm: k-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 17: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Question: is K-means guaranteed to converge for any dataset D?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

could one or more of the cluster centres diverge or oscillate?

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 18: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise

Note:∑K

k=1 sn,k = 1

Cost:

C({sn,k}, {mk}) =N∑

n=1

K∑k=1

sn,k ||xn −mk ||2

K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to

∑k sn,k = 1 and sn,k ∈ {0, 1}

K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.

C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.

Page 19: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise

Note:∑K

k=1 sn,k = 1

Cost:

C({sn,k}, {mk}) =N∑

n=1

K∑k=1

sn,k ||xn −mk ||2

K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to

∑k sn,k = 1 and sn,k ∈ {0, 1}

K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.

C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.

Page 20: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise

Note:∑K

k=1 sn,k = 1

Cost:

C({sn,k}, {mk}) =N∑

n=1

K∑k=1

sn,k ||xn −mk ||2

K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to

∑k sn,k = 1 and sn,k ∈ {0, 1}

K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.

C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.

Page 21: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise

Note:∑K

k=1 sn,k = 1

Cost:

C({sn,k}, {mk}) =N∑

n=1

K∑k=1

sn,k ||xn −mk ||2

K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to

∑k sn,k = 1 and sn,k ∈ {0, 1}

K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.

C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.

Page 22: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 23: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 24: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 25: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 26: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 27: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 28: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 29: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 30: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 31: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 32: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 33: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 34: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Where will K-means converge to when run on these data?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

Page 35: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Pathologies of K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

I strange results when clustershave very different shapes

I returns hard assignmentsI soft assignments more

sensible

I initialisation delicateI k++ means algorithm:

spreads out initial centres

Page 36: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Pathologies of K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

I strange results when clustershave very different shapes

I returns hard assignmentsI soft assignments more

sensible

I initialisation delicateI k++ means algorithm:

spreads out initial centres

Page 37: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Mixture of Gaussians: Generative Model

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

For each data point 1 . . .N

sample cluster membership:

p(sn = k|θ) = πk note∑K

k=1 πk = 1

sample data-value given cluster mem.:

p(xn|sn = k, θ) = N (xn; mk ,Σk)

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

Page 38: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Mixture of Gaussians: Generative Model

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

For each data point 1 . . .N

sample cluster membership:

p(sn = k|θ) = πk note∑K

k=1 πk = 1

sample data-value given cluster mem.:

p(xn|sn = k, θ) = N (xn; mk ,Σk)

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

Page 39: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Mixture of Gaussians: Generative Model

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

For each data point 1 . . .N

sample cluster membership:

p(sn = k|θ) = πk note∑K

k=1 πk = 1

sample data-value given cluster mem.:

How can we learn the parame-ters of this model from data usingmaximum-likelihood?

θML = arg maxθ

log p({xn}Nn=1|θ)

p(xn|sn = k, θ) = N (xn; mk ,Σk)

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

Page 40: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1

4 , π2 = 12 and π3 = 1

4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1

2 , σ2 = 12 and σ3 = 1

x-5 0 5

p(x

|θ)

A.

x-5 0 5

B.

x-5 0 5

C.

Mixture of Gaussians: p(s = k|θ) = πk p(x|s = k, θ) = N (x;µk , σ2k)

Page 41: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1

4 , π2 = 12 and π3 = 1

4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1

2 , σ2 = 12 and σ3 = 1

x-5 0 5

p(x

|θ)

A.

x-5 0 5

B.

x-5 0 5

C.

p(x|θ) = 12N (x; 0, 1

22) + 1

4N (x; 3, 12

2) + 14N (x;−3, 1)

Page 42: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 43: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 44: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 45: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihood

I Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 46: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 47: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 48: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 so

log p({xn}Nn=1|θ) = log p(x|θ) =∑

slog p(x, s|θ)

Page 49: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

=N∑

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

∑s

log p(x, s|θ)

Page 50: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

log-likelihood

Page 51: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

log-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

Page 52: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

Page 53: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

free-energy

Page 54: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A brief introduction to the Kullback-Leibler divergence

KL(p1(z)||p2(z)) =∑

zp1(z) log p1(z)

p2(z)Important properties:I Gibb’s inequality: KL(p1(z)||p2(z)) ≥ 0, equality at p1(z) = p2(z)

I proof via Jensen’s inequality or differentiation (see MacKay pg. 35 )I Non-symmetric: KL(p1(z)||p2(z)) 6= KL(p2(z)||p1(z))

I hence named divergence and not distanceExample:

I binary variables z ∈ {0, 1}I p(z = 1) = 0.8 and q(z = 1) = ρ

ρ0 0.5 1

KL

(q || p

)

0

2

4

6

ρ0 0.5 1

KL

(p || q

)

0

2

4

6

0.8 0.8

Page 55: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

free-energy

Page 56: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

Non-negativity of KL-Divergence

Free-energy is lower bound onlog-likelihood

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

free-energy

Page 57: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

Non-negativity of KL-Divergence

KL-Divergence equal to 0when

Free-energy is lower bound onlog-likelihood

Free-energy equal to log-likelihood when

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

free-energy

Page 58: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

A lower bound on the log-likelihood log p(x|θ)

Non-negativity of KL-Divergence

KL-Divergence equal to 0when

Free-energy is lower bound onlog-likelihood

Free-energy equal to log-likelihood when

KL-Divergencelog-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

simple to compute

free-energy

Page 59: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 60: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 61: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 62: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 63: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 64: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is the maximal value of the free-energy along this vertical slice?

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

?

?

Page 65: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is the maximal value of the free-energy along this vertical slice?

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 66: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is the maximal value of the free-energy along this vertical slice?

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 67: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What is the maximal value of the free-energy along this vertical slice?

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 68: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 69: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 70: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 71: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 72: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 73: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 74: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 75: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 76: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 77: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 78: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithm

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

Page 79: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:

E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).

M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

the second term is the entropy of q(s), independent of θ, so the M step is

θt = argmaxθ

∑s

qt(s) log(p(x|s, θ)p(s|θ)

).

Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as

log p(x|θt−1)E step

= F(qt(s), θt−1)M step≤ F(qt(s), θt)

lower bound≤ log p(x|θt)

Page 80: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:

E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).

M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

the second term is the entropy of q(s), independent of θ, so the M step is

θt = argmaxθ

∑s

qt(s) log(p(x|s, θ)p(s|θ)

).

Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as

log p(x|θt−1)E step

= F(qt(s), θt−1)M step≤ F(qt(s), θt)

lower bound≤ log p(x|θt)

Page 81: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:

E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).

M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

the second term is the entropy of q(s), independent of θ, so the M step is

θt = argmaxθ

∑s

qt(s) log(p(x|s, θ)p(s|θ)

).

Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as

log p(x|θt−1)E step

= F(qt(s), θt−1)M step≤ F(qt(s), θt)

lower bound≤ log p(x|θt)

Page 82: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ

2k , πk}k=1...K

I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:

p(xn |sn = k, θ) = 1√2πσ2

ke

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

so the E step becomes:

q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2

ke

− 12σ2

k(xn−µk)2

= unk

That is: q(sn = k) = rnk = unkun

where un =∑K

k=1 unk

Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.

Page 83: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ

2k , πk}k=1...K

I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:

p(xn |sn = k, θ) = 1√2πσ2

ke

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

so the E step becomes:

q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2

ke

− 12σ2

k(xn−µk)2

= unk

That is: q(sn = k) = rnk = unkun

where un =∑K

k=1 unk

Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.

Page 84: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ

2k , πk}k=1...K

I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:

p(xn |sn = k, θ) = 1√2πσ2

ke

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

so the E step becomes:

q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2

ke

− 12σ2

k(xn−µk)2

= unk

That is: q(sn = k) = rnk = unkun

where un =∑K

k=1 unk

Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.

Page 85: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

n=1

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

=N∑

n=1q(sn = j)xn − µj

σ2j

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

,

∂F∂σ2

j=

N∑n=1

q(sn = j)[

(xn − µj)2

2σ4j

− 12σ2

j

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

,

∂[F + λ(1−∑

k πk)]∂πj

=N∑

n=1

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

Page 86: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

n=1

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

=N∑

n=1q(sn = j)xn − µj

σ2j

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

,

∂F∂σ2

j=

N∑n=1

q(sn = j)[

(xn − µj)2

2σ4j

− 12σ2

j

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

,

∂[F + λ(1−∑

k πk)]∂πj

=N∑

n=1

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

Page 87: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

n=1

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

=N∑

n=1q(sn = j)xn − µj

σ2j

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

,

∂F∂σ2

j=

N∑n=1

q(sn = j)[

(xn − µj)2

2σ4j

− 12σ2

j

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

,

∂[F + λ(1−∑

k πk)]∂πj

=N∑

n=1

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

Page 88: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

n=1

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

=N∑

n=1q(sn = j)xn − µj

σ2j

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

,

∂F∂σ2

j=

N∑n=1

q(sn = j)[

(xn − µj)2

2σ4j

− 12σ2

j

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

,

∂[F + λ(1−∑

k πk)]∂πj

=N∑

n=1

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

Page 89: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

n=1

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

=N∑

n=1q(sn = j)xn − µj

σ2j

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

,

∂F∂σ2

j=

N∑n=1

q(sn = j)[

(xn − µj)2

2σ4j

− 12σ2

j

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

,

∂[F + λ(1−∑

k πk)]∂πj

=N∑

n=1

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

Page 90: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 91: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 92: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 93: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 94: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 95: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 96: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 97: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 98: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 99: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 100: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 101: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 102: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 103: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 104: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 105: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 106: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 107: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 108: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 109: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

EM for MoGs: soft, non-axis aligned K-means

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ

0 0.5 1 1.5

free-e

nerg

y-1500

-1000

-500

Page 110: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 111: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 112: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 113: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 114: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 115: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 116: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 117: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 118: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 119: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 120: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 121: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 122: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 123: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 124: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 125: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 126: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

Page 127: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

KABOOM!

Page 128: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

What will happen with this initialisation?

x1

-2 -1 0 1 2

x2

-2

-1

0

1

2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

θ

∑n,k rnk log p(sn = k,x|θ)

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

nerg

y

50

100

150

KABOOM!maximum likelihood can overfitΣk must be restrictedfor bounded F(q(s), θ)

Page 129: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Summary

I MoG + EM algorithm = soft k-means clustering with non-axisaligned, non-equally weighted clusters

I EM can be used to fit latent variable models e.g. PCA, Factoranalysis, MoGs, HMMs, ...

I requires tractable posterior p(s|x, θ) entropy andaverage log-joint Eq(s|θ) [log p(s,x|θ)]

LimitationsI MoG clusters still have simple shapes (ellipses)

I a single real cluster might be described by many componentsI more complex cluster models have been developed

I maximum-likelihood can overfitI Bayesian approaches avoid overfitting p(θ, s|x)

I co-ordinate ascent is often slow to converge (lots of iterationsrequired)

I joint optimisation of F(q(s), θ) fasterI direct optimisation of log-likelihood log p(x|θ)

Page 130: Clustering and the EM algorithmlearning.eng.cam.ac.uk/.../Turner/Teaching/clustering.pdfWhat is clustering? x 1 x 2 network community detection Campbell et al Social Network Analysis

Appendix: proof of KL divergence propertiesMinimise Kullback Leibler divergence (relative entropy) KL(q(x)||p(x)):add Lagrange multiplier (enforce q(x) normalises), take variationalderivatives:

δ

δq(x)[ ∫

q(x) log q(x)p(x)dx + λ

(1−

∫q(x)dx

)]= log q(x)

p(x) + 1− λ.

Find staionary point by setting the derivative to zero:

q(x) = exp(λ−1)p(x), normalization conditon λ = 1, so q(x) = p(x),

which corresponds to a minimum, since the second derivative is positive:

δ2

δq(x)δq(x)KL(q(x)||p(x)) = 1q(x) > 0.

The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0,showing that KL(q(x)||p(x))

I is non-negativeI attains its minimum 0 when p(x) and q(x) are equal