finite mixture models and expectation maximizationcse802/em_802.pdf · review of maximum likelihood...

Finite Mixture Models and Expectation

Maximization

Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain

and Dr. Rong Jin

Recall: The Supervised Learning Problem

2

� Given a set of n samples X = {(xi , yi)}, i = 1,…,n

� Chapter 3 of DHS

� Assume examples in each class come from a parameterized Gaussian density

� Estimate the parameters (mean, variance) of the

Gaussian density for each class, and use them for classification

� Estimation uses Maximum Likelihood approach.

Review of Maximum Likelihood� Given n i.i.d. examples from a density p(x;θ), with

known form p and unknown parameter θ.

� Goal: estimate θ, denoted by , such that the

observed data is most likely to be from the distribution with that θ.

� Steps involved:

� Write the likelihood of the observed data.

� Maximize the likelihood with respect to the parameter.

3

θ

Example: 1D Gaussian Distribution

4

-1 -0.5 0 0.5 1 1.5 2 2.5 3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

Maximum Likelihood Estimation of Mean of a Gaussian Distribution

True Density

MLE from 10 Samples

MLE from 100 samples

5

Example: 2D Gaussian Distribution

x

y

2D Parameter Estimation for Gaussian Distribution

-2 -1 0 1 2 3 4

0

1

2

3

4

5

Blue: True density

Red : Estimated from

50 examples.

� A single Gaussian may not accurately model the classes.

� Find subclasses in handwritten “online” characters (122,000

characters written by 100 writers)

� Performance improves by modeling subclasses

Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002

Multimodal Class Distributions

Multimodal Classes

7

A single Gaussian distribution may not model the classes accurately.

Handwritten ‘f’ vs ‘y’

classification task.

An extreme example of multimodal classes

8

Red vs. Blue classification.

The classes are well separated.

However, incorrect model

assumptions result in high

classification error.

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10

x

y

Limitations of Unimodal class modelling

The ‘red’ class is a “mixture of two Gaussian distributions”

There is no class label information, when modeling the

density of just the red class.

9

Finite mixtures

k random sources, probability density functions fi(x), i=1,…,k

X random variable

fk(x)

fi(x)

f2(x)

f1(x)

Choose at random

10

Example: 3 species (Iris)

Finite mixtures

11∑

=

α=k

1i

ii )x(f

Choose at random,

Prob.(source i) = αi

Finite mixtures

X random variable

fk(x)

fi(x)

f2(x)

f1(x)

f (x|source i) = fi (x)Conditional:

f (x and source i) = fi (x) αiJoint:

f(x) = Unconditional: f (x and source i) ∑sourcesall

12

Finite mixtures

∑=

α=k

1i

ii )x(f)x(f

Component densities

Parameterized components (e.g., Gaussian):

∑=

θα=Θk

1i

ii )|x(f)|x(f

},...,,,θ,...,θ,θ{ k21k21 ααα=Θ

Mixing probabilities: ∑=

=αk

1i

i 10i ≥α and

)θ|x(f)x(f ii =

13

Gaussian mixtures

)|x(f iθ Gaussian

Arbitrary covariances: ),|x(N)|x(f iii Cµ=θ

Common covariance: ),|x(N)|x(f ii Cµ=θ

},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ

},...,,,,...,,{ 1k21k21 −ααα=Θ Cµµµ

14

Mixture fitting / estimation

Data: n independent observations, }x,...,x,x{ )n()2()1(=x

Goals: estimate the parameter set ,Θ

Example:

- How many species ? Mean of each species ?

- Which points belong to each species ?

maybe “classify the observations”

Classified data (classes unknown) Observed data

15

Gaussian mixtures (d=1), an example

21 −=µ

31 =σ

6.01 =α

42 =µ

22 =σ

3.02 =α

73 =µ

1.03 =σ

1.03 =α

16

Gaussian mixtures, an R2 example

k = 3

−=

4

03µ

=

10

013C

=

3

32µ

−

−=

21

122C

−=

4

41µ

=

80

022C

(1500 points)

17

Uses of mixtures in pattern recognition

Unsupervised learning (model-based clustering):

- each component models one cluster

- clustering = mixture fitting

Goals:

- find the classes,

- classify the points

Observations:

- unclassified points

18


Mixtures can approximate arbitrary densities

Good to represent class conditional densities in supervised learning

Example:

- two strongly non-Gaussian

classes.

- Use mixtures to model each

class-conditional density.

Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002

� Find subclasses (lexemes)

� Eg. “online” characters

� Performance improves by modeling subclasses

122,000 characters written by 100 writers


20

Fitting mixtures

n independent observations

}x,...,x,x{ )n()2()1(=x

Maximum (log)likelihood (ML) estimate of Θ:

),x(Lmaxargˆ Θ=ΘΘ

∏=

Θ=Θn

1j

)j( )|x(flog),x(L ∑ ∑= =

θα=n

1j

k

1i

i

)j(

i )|(flog x

mixture

ML estimate has no closed-form solution

21

Gaussian mixtures: A peculiar type of ML

},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ

Maximum (log)likelihood (ML) estimate of Θ:

),x(Lmaxargˆ Θ=ΘΘ

Subject to: definitepositiveiC

∑=

=αk

1i

i 10i ≥α and

Unusual goal: a “good” local maximum

There is no global maximum.

Problem: the likelihood function is unbounded as 0)det( i →C

22

A Peculiar type of ML problem

Example: a 2-component Gaussian mixture

2

)x(

σ2

)x(

2

2

21

22

21

21

e2

1e

σ2),σ,µ,µ|x(f

µ−−

µ−−

π

α−+

π

α=α

}x,...,x,x{ n21Some data points:

∑=

µ−−

+

π

α−+

π

α=Θ

n

2j

2

)x(

2log(...)e

2

1

σ2log),x(L

221

0σas, 2 →∞→

11 xµ =

23

Fitting mixtures: a missing data problem

ML estimate has no closed-form solution

Missing data problem:

Standard alternative: expectation-maximization (EM) algorithm

[ ] [ ] T)j(

k

)j(

2

)j(

1

)j( 0...010...0,z...,,z,z ==z

“1” at position i ⇔ x( j) generated by component i

}x,...,x,x{ )n()2()1(=xObserved data:

Missing labels (“colors”)

Missing data: },...,,{ )n()2()1(zzzz =

24

Fitting mixtures: a missing data problem

}x,...,x,x{ )n()2()1(=xObserved data:

Missing data: },...,,{ )n()2()1(zzzz =

Complete log-likelihood function:

( )∑∑= =

θα=Θn

1j

k

1i

i

)j(

ii

)j(

ic )|(flogz),,(L xzx

)|z,x(flog )j()j( Θ

[ ])j(

k

)j(

1

)j( z...,,z=z

k-1 zeros,

one “ 1”

In the presence of both x and z, Θ would be easy to estimate,

…but z is missing.

25

The EM algorithm

The E-step: compute the expected value of ),,(Lc Θzx

Iterative procedure: ,...ˆ,ˆ,...,ˆ,ˆ )1t()t()1()0( +ΘΘΘΘ

)ˆ,(Q]ˆ,x|),,(L[E )t()t(

c ΘΘ≡ΘΘzx

The M-step: update parameter estimates

)ˆ,(Qmaxargˆ )t()1t( ΘΘ=ΘΘ

+

Under mild conditions: ∞→

→Θt

)t(ˆ local maximum of ),(L Θx

26

The EM algorithm: the Gaussian case

]ˆ,|),,(L[E)ˆ,(Q )t(

cZ

)t( ΘΘ≡ΘΘ xzx

)],ˆ,|[E,(L )t(

c ΘΘ= xzx

The E-step:

Because ),,(Lc Θzx is linear in z

Binary variable

=Θ ]ˆ,|z[E )t()j(

i x

→)t,j(

iw Estimate, at iteration t, of the probability that x( j) was

produced by component i

“Soft” probabilistic assignment

}ˆ,|1zPr{ )t()j()j(

i Θ= x)t,j(

ik

1n

)t(

n

)j(

n

)t(

i

)j(

i w

)ˆ|x(fˆ

)ˆ|x(fˆ≡

θα

θα=

∑=

Bayes law

27

The EM algorithm: the Gaussian case

→)t,j(

iw Estimate, at iteration t, of

the probability that x( j) was produced by component i

Result of the E-step:

The M-step: ∑=

+ =αn

1j

)t,j(

i

)1t(

i wn

1ˆ

∑

∑

=

=+ =µn

1j

)t,j(

i

n

1j

)j()t,j(

i

)1t(

i

w

xw

ˆ

∑

∑

=

=

++

+

µ−µ−

=n

1j

)t,j(

i

n

1j

T)1t(

i

)j()1t(

i

)j()t,j(

i

)1t(

i

w

)ˆx()ˆx(w

C

28

Difficulties with EM

It is a local (greedy) algorithm (likelihood never dcreases)

Initialization dependent

74 iterations 270 iterations

29

Add a penalty term to the objective function, which increases

with the number of clusters

Start with a large number of clusters

Modify the M-step to include a “killer criterion” which removes

components satisfying certain criterion

Finally, choose the number of components, resulting in with the

largest objective function value (likelihood - penalty).

Automatically deciding the number of components

30

Example

Same as in [Ueda and Nakano, 1998].

31

Example

k = 4

n = 1200

4

1m =α

C = I

=µ

0

01

=µ

4

01

=µ

0

41

=µ

4

41

kmax = 10

32

Example

Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].

33

An example with overlapping components

Example

34

The iris (4-dim.) data-set: 3 components correctly identified

35

Another supervised learning example

Problem: learn to classify textures, from 19 Gabor features.

-Fit Gaussian mixtures to 800 randomly located feature vectors

from each class/texture.

- Four classes:

-Test on the remaining data.

0.0155Quadratic disriminant

0.0185Linear discriminant

0.0074Mixture-based

Error rate

36

2-d projection of the texture

data and the obtained mixtures

Resulting decision regions

37

Properties of EM

EM is extremely popular because of the following properties:

•Easy to implement

•Guarantees the likelihood increases monotonically (why?)

•Guarantees the convergence of the solution to a stationary point

i.e., local maxima (why?).

Limitations of EM

•resulting solution depends highly on the initialization

•Could be slow in several cases compared to direct optimization

methods (e.g., Iterative scaling)

1 2( , )l θ θ

• Start with initial guess

0 01 2,θ θ

0 01 2,θ θ

EM as lower bound optimization


0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ

• Come up with a lower bounded

{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2

( , ) is a concave function

Touch point: ( , ) 0

Q

Q

θ θ

θ θ θ θ= = =

Touch

Point



0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ


{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2



Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution

that maximizes 1 2( , )Q θ θ

{ }1 11 2,θ θ



1 1

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ


{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2



Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution

that maximizes

• Repeat the procedure

1 2( , )Q θ θ

{ }1 11 2,θ θ { }2 2

1 2,θ θ


Optimal

Point

1 2( , )l θ θ

{ }0 01 2,θ θ { }1 1

1 2,θ θ { }2 21 2, , ...θ θ



{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2



Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution that

maximizes

• Repeat the procedure

• Converge to the local optimal

1 2( , )Q θ θ


43

Summary

• Expectation-Maximization algorithm

• E step: Compute expected complete data likelihood

• Mstep: Maximize the likelihood to find parameters

• Can be used with any model with hidden (latent) variables

• Hidden variables can be natural to the model or can be

artificially introduced.

• Makes the parameter estimation simpler, and efficient

• EM algorithm can be explained from many perspectives

• Bound optimization

• Proximal point optimization, etc

• Several generalizations/specializations exist

• Easy to implement, and is widely used!

finite mixture models and expectation maximizationcse802/em_802.pdf · review of maximum likelihood...

Documents