finite mixture models and expectation maximizationcse802/em_802.pdf · review of maximum likelihood...

43
Finite Mixture Models and Expectation Maximization Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Upload: others

Post on 13-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Finite Mixture Models and Expectation

Maximization

Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain

and Dr. Rong Jin

Page 2: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Recall: The Supervised Learning Problem

2

� Given a set of n samples X = {(xi , yi)}, i = 1,…,n

� Chapter 3 of DHS

� Assume examples in each class come from a parameterized Gaussian density

� Estimate the parameters (mean, variance) of the

Gaussian density for each class, and use them for classification

� Estimation uses Maximum Likelihood approach.

Page 3: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Review of Maximum Likelihood� Given n i.i.d. examples from a density p(x;θ), with

known form p and unknown parameter θ.

� Goal: estimate θ, denoted by , such that the

observed data is most likely to be from the distribution with that θ.

� Steps involved:

� Write the likelihood of the observed data.

� Maximize the likelihood with respect to the parameter.

3

θ

Page 4: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Example: 1D Gaussian Distribution

4

-1 -0.5 0 0.5 1 1.5 2 2.5 3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

Maximum Likelihood Estimation of Mean of a Gaussian Distribution

True Density

MLE from 10 Samples

MLE from 100 samples

Page 5: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

5

Example: 2D Gaussian Distribution

x

y

2D Parameter Estimation for Gaussian Distribution

-2 -1 0 1 2 3 4

0

1

2

3

4

5

Blue: True density

Red : Estimated from

50 examples.

Page 6: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

� A single Gaussian may not accurately model the classes.

� Find subclasses in handwritten “online” characters (122,000

characters written by 100 writers)

� Performance improves by modeling subclasses

Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002

Multimodal Class Distributions

Page 7: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Multimodal Classes

7

A single Gaussian distribution may not model the classes accurately.

Handwritten ‘f’ vs ‘y’

classification task.

Page 8: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

An extreme example of multimodal classes

8

Red vs. Blue classification.

The classes are well separated.

However, incorrect model

assumptions result in high

classification error.

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10

x

y

Limitations of Unimodal class modelling

The ‘red’ class is a “mixture of two Gaussian distributions”

There is no class label information, when modeling the

density of just the red class.

Page 9: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

9

Finite mixtures

k random sources, probability density functions fi(x), i=1,…,k

X random variable

fk(x)

fi(x)

f2(x)

f1(x)

Choose at random

Page 10: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

10

Example: 3 species (Iris)

Finite mixtures

Page 11: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

11∑

=

α=k

1i

ii )x(f

Choose at random,

Prob.(source i) = αi

Finite mixtures

X random variable

fk(x)

fi(x)

f2(x)

f1(x)

f (x|source i) = fi (x)Conditional:

f (x and source i) = fi (x) αiJoint:

f(x) = Unconditional: f (x and source i) ∑sourcesall

Page 12: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

12

Finite mixtures

∑=

α=k

1i

ii )x(f)x(f

Component densities

Parameterized components (e.g., Gaussian):

∑=

θα=Θk

1i

ii )|x(f)|x(f

},...,,,θ,...,θ,θ{ k21k21 ααα=Θ

Mixing probabilities: ∑=

=αk

1i

i 10i ≥α and

)θ|x(f)x(f ii =

Page 13: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

13

Gaussian mixtures

)|x(f iθ Gaussian

Arbitrary covariances: ),|x(N)|x(f iii Cµ=θ

Common covariance: ),|x(N)|x(f ii Cµ=θ

},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ

},...,,,,...,,{ 1k21k21 −ααα=Θ Cµµµ

Page 14: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

14

Mixture fitting / estimation

Data: n independent observations, }x,...,x,x{ )n()2()1(=x

Goals: estimate the parameter set ,Θ

Example:

- How many species ? Mean of each species ?

- Which points belong to each species ?

maybe “classify the observations”

Classified data (classes unknown) Observed data

Page 15: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

15

Gaussian mixtures (d=1), an example

21 −=µ

31 =σ

6.01 =α

42 =µ

22 =σ

3.02 =α

73 =µ

1.03 =σ

1.03 =α

Page 16: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

16

Gaussian mixtures, an R2 example

k = 3

−=

4

03µ

=

10

013C

=

3

32µ

−=

21

122C

−=

4

41µ

=

80

022C

(1500 points)

Page 17: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

17

Uses of mixtures in pattern recognition

Unsupervised learning (model-based clustering):

- each component models one cluster

- clustering = mixture fitting

Goals:

- find the classes,

- classify the points

Observations:

- unclassified points

Page 18: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

18

Uses of mixtures in pattern recognition

Mixtures can approximate arbitrary densities

Good to represent class conditional densities in supervised learning

Example:

- two strongly non-Gaussian

classes.

- Use mixtures to model each

class-conditional density.

Page 19: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002

� Find subclasses (lexemes)

� Eg. “online” characters

� Performance improves by modeling subclasses

122,000 characters written by 100 writers

Uses of mixtures in pattern recognition

Page 20: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

20

Fitting mixtures

n independent observations

}x,...,x,x{ )n()2()1(=x

Maximum (log)likelihood (ML) estimate of Θ:

),x(Lmaxargˆ Θ=ΘΘ

∏=

Θ=Θn

1j

)j( )|x(flog),x(L ∑ ∑= =

θα=n

1j

k

1i

i

)j(

i )|(flog x

mixture

ML estimate has no closed-form solution

Page 21: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

21

Gaussian mixtures: A peculiar type of ML

},...,,,...,,,,...,,{ 1k21k21k21 −ααα=Θ CCCµµµ

Maximum (log)likelihood (ML) estimate of Θ:

),x(Lmaxargˆ Θ=ΘΘ

Subject to: definitepositiveiC

∑=

=αk

1i

i 10i ≥α and

Unusual goal: a “good” local maximum

There is no global maximum.

Problem: the likelihood function is unbounded as 0)det( i →C

Page 22: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

22

A Peculiar type of ML problem

Example: a 2-component Gaussian mixture

2

)x(

σ2

)x(

2

2

21

22

21

21

e2

1e

σ2),σ,µ,µ|x(f

µ−−

µ−−

π

α−+

π

α=α

}x,...,x,x{ n21Some data points:

∑=

µ−−

+

π

α−+

π

α=Θ

n

2j

2

)x(

2log(...)e

2

1

σ2log),x(L

221

0σas, 2 →∞→

11 xµ =

Page 23: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

23

Fitting mixtures: a missing data problem

ML estimate has no closed-form solution

Missing data problem:

Standard alternative: expectation-maximization (EM) algorithm

[ ] [ ] T)j(

k

)j(

2

)j(

1

)j( 0...010...0,z...,,z,z ==z

“1” at position i ⇔ x( j) generated by component i

}x,...,x,x{ )n()2()1(=xObserved data:

Missing labels (“colors”)

Missing data: },...,,{ )n()2()1(zzzz =

Page 24: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

24

Fitting mixtures: a missing data problem

}x,...,x,x{ )n()2()1(=xObserved data:

Missing data: },...,,{ )n()2()1(zzzz =

Complete log-likelihood function:

( )∑∑= =

θα=Θn

1j

k

1i

i

)j(

ii

)j(

ic )|(flogz),,(L xzx

)|z,x(flog )j()j( Θ

[ ])j(

k

)j(

1

)j( z...,,z=z

k-1 zeros,

one “ 1”

In the presence of both x and z, Θ would be easy to estimate,

…but z is missing.

Page 25: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

25

The EM algorithm

The E-step: compute the expected value of ),,(Lc Θzx

Iterative procedure: ,...ˆ,ˆ,...,ˆ,ˆ )1t()t()1()0( +ΘΘΘΘ

)ˆ,(Q]ˆ,x|),,(L[E )t()t(

c ΘΘ≡ΘΘzx

The M-step: update parameter estimates

)ˆ,(Qmaxargˆ )t()1t( ΘΘ=ΘΘ

+

Under mild conditions: ∞→

→Θt

)t(ˆ local maximum of ),(L Θx

Page 26: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

26

The EM algorithm: the Gaussian case

]ˆ,|),,(L[E)ˆ,(Q )t(

cZ

)t( ΘΘ≡ΘΘ xzx

)],ˆ,|[E,(L )t(

c ΘΘ= xzx

The E-step:

Because ),,(Lc Θzx is linear in z

Binary variable

=Θ ]ˆ,|z[E )t()j(

i x

→)t,j(

iw Estimate, at iteration t, of the probability that x( j) was

produced by component i

“Soft” probabilistic assignment

}ˆ,|1zPr{ )t()j()j(

i Θ= x)t,j(

ik

1n

)t(

n

)j(

n

)t(

i

)j(

i w

)ˆ|x(fˆ

)ˆ|x(fˆ≡

θα

θα=

∑=

Bayes law

Page 27: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

27

The EM algorithm: the Gaussian case

→)t,j(

iw Estimate, at iteration t, of

the probability that x( j) was produced by component i

Result of the E-step:

The M-step: ∑=

+ =αn

1j

)t,j(

i

)1t(

i wn

=

=+ =µn

1j

)t,j(

i

n

1j

)j()t,j(

i

)1t(

i

w

xw

ˆ

=

=

++

+

µ−µ−

=n

1j

)t,j(

i

n

1j

T)1t(

i

)j()1t(

i

)j()t,j(

i

)1t(

i

w

)ˆx()ˆx(w

C

Page 28: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

28

Difficulties with EM

It is a local (greedy) algorithm (likelihood never dcreases)

Initialization dependent

74 iterations 270 iterations

Page 29: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

29

Add a penalty term to the objective function, which increases

with the number of clusters

Start with a large number of clusters

Modify the M-step to include a “killer criterion” which removes

components satisfying certain criterion

Finally, choose the number of components, resulting in with the

largest objective function value (likelihood - penalty).

Automatically deciding the number of components

Page 30: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

30

Example

Same as in [Ueda and Nakano, 1998].

Page 31: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

31

Example

k = 4

n = 1200

4

1m =α

C = I

0

01

4

01

0

41

4

41

kmax = 10

Page 32: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

32

Example

Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].

Page 33: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

33

An example with overlapping components

Example

Page 34: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

34

The iris (4-dim.) data-set: 3 components correctly identified

Page 35: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

35

Another supervised learning example

Problem: learn to classify textures, from 19 Gabor features.

-Fit Gaussian mixtures to 800 randomly located feature vectors

from each class/texture.

- Four classes:

-Test on the remaining data.

0.0155Quadratic disriminant

0.0185Linear discriminant

0.0074Mixture-based

Error rate

Page 36: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

36

2-d projection of the texture

data and the obtained mixtures

Resulting decision regions

Page 37: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

37

Properties of EM

EM is extremely popular because of the following properties:

•Easy to implement

•Guarantees the likelihood increases monotonically (why?)

•Guarantees the convergence of the solution to a stationary point

i.e., local maxima (why?).

Limitations of EM

•resulting solution depends highly on the initialization

•Could be slow in several cases compared to direct optimization

methods (e.g., Iterative scaling)

Page 38: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

1 2( , )l θ θ

• Start with initial guess

0 01 2,θ θ

0 01 2,θ θ

EM as lower bound optimization

Page 39: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

• Start with initial guess

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ

• Come up with a lower bounded

{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2

( , ) is a concave function

Touch point: ( , ) 0

Q

Q

θ θ

θ θ θ θ= = =

Touch

Point

EM as lower bound optimization

Page 40: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

• Start with initial guess

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ

• Come up with a lower bounded

{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2

( , ) is a concave function

Touch point: ( , ) 0

Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution

that maximizes 1 2( , )Q θ θ

{ }1 11 2,θ θ

EM as lower bound optimization

Page 41: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

• Start with initial guess

1 1

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2( , )l θ θ

• Come up with a lower bounded

{ }0 01 2,θ θ

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2

( , ) is a concave function

Touch point: ( , ) 0

Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution

that maximizes

• Repeat the procedure

1 2( , )Q θ θ

{ }1 11 2,θ θ { }2 2

1 2,θ θ

EM as lower bound optimization

Page 42: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

Optimal

Point

1 2( , )l θ θ

{ }0 01 2,θ θ { }1 1

1 2,θ θ { }2 21 2, , ...θ θ

• Start with initial guess

• Come up with a lower bounded

{ }1 2 0,θ θ

0 0

1 2 1 2 1 2( , ) ( , ) ( , )l l Qθ θ θ θ θ θ≥ +

1 2

0 0

1 1 2 2

( , ) is a concave function

Touch point: ( , ) 0

Q

Q

θ θ

θ θ θ θ= = =

• Search the optimal solution that

maximizes

• Repeat the procedure

• Converge to the local optimal

1 2( , )Q θ θ

EM as lower bound optimization

Page 43: Finite Mixture Models and Expectation Maximizationcse802/EM_802.pdf · Review of Maximum Likelihood Given n i.i.d. examples from a density p(x; θ), with known form p and unknown

43

Summary

• Expectation-Maximization algorithm

• E step: Compute expected complete data likelihood

• Mstep: Maximize the likelihood to find parameters

• Can be used with any model with hidden (latent) variables

• Hidden variables can be natural to the model or can be

artificially introduced.

• Makes the parameter estimation simpler, and efficient

• EM algorithm can be explained from many perspectives

• Bound optimization

• Proximal point optimization, etc

• Several generalizations/specializations exist

• Easy to implement, and is widely used!