expectation maximization (em) mixture of gaussians

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 18

Milos [email protected] Sennott Square

Expectation Maximization (EM)Mixture of Gaussians

http://cs.pitt.educ


Learning probability distributions

Basic learning settings:• A set of random variables • A model of the distribution over variables in X

with parameters • Data s.t.Objective: find parameters that describe p(X) based on DAssumptions considered so far:

– Known parameterizations– No hidden variables – No-missing values

This lecture: learning with hidden variables and missing values

},,,{ 21 nXXX !=X

Q},..,,{ 21 NDDDD =Q̂

),,( 21in

iii xxxD !=


Learning with hidden variables and missing values

Problem: we want to learn from data D • Data consists of instances with values assigned to variables• But sometimes the values are hidden or missing

)(XP

Data 1x 2x 5x

T T F T TT - T F FF T - F T

Missing values

ExtendedData

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h

Hidden variable h(all values missing)


Density estimation

Goal: Find the best set of parameters based on dataEstimation criteria:

– ML

Why hidden vars and missing values are a problem? The parameter estimation for models defined by a set of parameters does not decompose to a set of independent estimation tasks

Approaches for ML estimates: general optimization methods: gradient-ascent, conjugate gradient, Newton-Rhapson, etc

An alternative optimization method: • Expectation-maximization (EM) method

– Suitable when there are missing or hidden values– Takes advantage of the structure of the belief network

),|(max xΘΘ

Dp

Q̂

),|( xDp Θ– Bayesian


General EM

The key idea of a method:Compute the parameter estimates iteratively by performing the

following two steps: Two steps of the EM:1. Expectation step. For all hidden and missing variables (and

their possible value assignments) calculate their expectations for the current set of parameters

2. Maximization step. Compute the new estimates of by considering the expectations of the different value completions

Stop when no improvement possible

'ΘΘ


General EM

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h Calculate expectations for all unknownvalue assignments using

Current parameters estimating P(x) (1) Expectation step. For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters

'Θ

'Θ

)',|,( )2()2()2(2 Q== DThTxP

)',|( )1()1( Q= DThP)',|( )1()1( Q= DFhP

)',|,( )2()2()2(2 Q== DFhTxP

)',|,( )2()2()2(2 Q== DThFxP

)',|,( )2()2()2(2 Q== DFhFxP

General EM

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h

Current parameters (2) Maximization step. Compute the new improved estimates of parameters by considering the expectations of the different value completions

'Θ

)',|,( )2()2()2(2 Q== DThTxP

)',|( )1()1( Q= DThP)',|( )1()1( Q= DFhP

)',|,( )2()2()2(2 Q== DFhTxP

)',|,( )2()2()2(2 Q== DThFxP

)',|,( )2()2()2(2 Q== DFhFxP

Θ

Compute the new estimates of Θ

Θ


EM detailsLet H – be a set of hidden or missing variable values in dataDerivation

),|(),,|(),|,( xxx QQ=Q DPDHPDHP

),|(log),,|(log),|,(log xxx Q+Q=Q DPDHPDHP

),,|(log),|,(log),|(log xxx Q-Q=Q DHPDHPDP

Average both sides with for some ),',|( xQDHP 'Q),|(log),|,(log),|(log ',|',|',| xxx Q-Q=Q QQQ HPEDHPEDPE DHDHDH

)'|()'|(),|(log QQ+QQ=Q HQDP x

Log-likelihood of data

Log-likelihood of data


EM algorithmAlgorithm (general formulation)

Initialize parametersRepeat Set 1. Expectation step

2. Maximization step

until no or small improvement in

Questions: Why this leads to the ML estimate ?What is the advantage of the algorithm?

Q

),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

)'|(maxarg QQ=QQ

Q)'( Q=Q

Q=Q'

Q


EM algorithmQuestion: Why is the EM algorithm correct?• Claim: maximizing Q improves the log-likelihood

)'|()'|()( QQ+QQ=Q HQl

)'|'()'|()'|'()'|()'()( QQ-QQ+QQ-QQ=Q-Q HHQQllDifference in log-likelihoods (current and next step)

0³Kullback-Leibler (KL) divergence• distance between 2 distributions)

å QQ-=Q-=QQ Qi

DH DHPDHpDHPEH ),,|(log)',|(),,|(log)'|( ',| xx

0log)|( ³=åi

i

ii R

PPRPKL Is always positive !!!

0),,|(),,'|(log)',|()'|'()'|( ³

QQ

Q=QQ-QQ åi DHP

DHPDHPHHxx


EM algorithm

)'|'()'|()'|'()'|()'()( QQ-QQ+QQ-QQ=Q-Q HHQQll

Difference in log-likelihoods

Thusby maximizing Q we improve (!!!) the log-likelihood

)'|'()'|()'()( QQ-QQ³Q-Q QQll

)'|()'|()( QQ+QQ=Q HQl

EM is a first-order optimization procedure• Climbs the gradient• Automatic learning rate

No need to adjust the learning rate !!!!


EM advantages

Key advantages:• In many problems (e.g. Bayesian belief networks)

– has a nice form and the maximization of can be carried out in the closed form

• No need to compute Q before maximizing • We directly optimize

– using quantities corresponding to expected counts


)'|( QQQ


Naïve Bayes with a hidden class and missing values

Assume:• is modeled using a Naïve Bayes model with a hidden

class variable C• Missing entries (values) for attributes in the dataset D

)(XP

1X 2X nX…

Hidden class variable

Attributes are independentgiven the class

C


EM for the Naïve Bayes

• Use EM to iteratively optimize

• Parameters:prior on class jprobability of an attribute i having value k given class j

• Indicator variables :for example l, the class is j ; if true (=1) else false (=0)for example l, the class is j and the value of attrib i is k

• because the class is hidden and some attributes are missing, the values (0,1) of indicator variables are not known; they are hidden

H – a collection of all indicator variables (that ‘complete’ the data)


jp

ijkq

ljdl

ijkd

1X iX nX…

Cjp

ijkq


EM for the Naïve Bayes model

• We can use EM to iteratively learn the parameters of the model),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

Õ ÕÕÕ=

=QN

l i kijk

jj

lijk

ljDHP

1

log),|,(log dd qpx

)loglog(1

ijki k

lijkj

N

l j

lj qdpd åååå +=

=

)log)(log)((),|,(log ',|1

',|',| ijki k

lijkDHj

N

l j

ljDHDH EEDHPE qdpdx åååå Q

=QQ +=Q

)',|()(',| Q==Q llljDH DjCpE d

)',|,()(',| Q===Q llillijkDH DjCkXpE d

Substitutes 0,1with the expected value

Complete loglikelihood

Expectation over H


EM for the Naïve Bayes model

• Computing derivatives of Q for parameters and setting it to 0 we get:

• Important:– Use expected counts instead of counts !!!– Re-estimate the parameters using expected counts

å=

=ir

kijk

ijkijk

N

N

1

~

~q

NN j

j

~=p

)',|()(~1

',|1

Q=== åå=

Q=

ll

N

l

ljDH

N

lj DjCpEN d

)',|,()(~1

',|1

Q==== åå=

Q=

llil

N

l

lijkDH

N

lijk DjCkXpEN d


EM for BBNs

• The same result applies to learning of parameters of any Bayesian belief network with discrete-valued variables

• Again:– Use expected counts instead of counts

å=

Q===N

l

lli

liijk DjpakxpN

1)',|,(~

å=

=ir

kijk

ijkijk

N

N

1

~

~q

may require inference


Parameter value maximizing Q

Gaussian mixture model

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Goal: We want a model of p(x)Problem: a multidimensional Gaussian not a good fit to data


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Goal: We want a density model of p(x)Problem: a multidimensional Gaussian not a good fit to dataBut three different Gaussians may do well


Recall QDAGenerative classifier model:• models of p(x,C) as p(x|C).p(C)• Class labels are knownModel:

= probability of a data instance coming from class C=i

= class conditional density (modeled as a Gaussian)for class i

X

)|( iCp =X

C)(CP

),()|( iiNiCp Σμx »=

)( iCp =


Recall QDAGenerative classifier model:• models of p(x,C) as p(x|C).p(C)• Class labels are knownThe ML estimate of parameters

å=

=iCj

il

N:1

NNi

i =p~

å=

=iCjj

ii

lN :

1~ xμ

Tiji

iCjj

ii

lN

))((1~:

μxμxΣ --= å=

X

)|( iCp =X

C)(CP


Gaussian mixture modelThe model of p(x) based on the model of p(x,C)• Class labels are not known

The same model as QDA:

= probability of a data point coming from class C=i

= class conditional density (modeled as a Gaussian)for class i

Important: C values are hidden !!!!

)|()()(1

iCpiCppk

i===å

=

xx

),()|( iiNiCp Σμx »=

X

)|( iCp =X

C)(CP

)( iCp =



Gaussian mixture: we do not know what Gaussian generates x• We can apply EM algorithm:

– re-estimation based on the class posterior

å=

Q=Q=

Q=Q==Q== m

ulll

lllllil

uCxpuCp

iCxpiCpiCph

1)',|()'|(

)',|()'|()',|( x

å=l

ili hN

NNi

i =p~

å=l

jili

i hN

xμ 1~

Tiji

ljil

ii hN

))((1~ μxμxΣ --= å

Count replaced with the expected count

l

l l


Gaussian mixture

• Density function p(x) for the Mixture of Gaussians model


GM algorithmAssume special case of the GM: a fixed and shared covariance matrix for all hidden groups and uniform prior on groups• Algorithm:

Initialize means for all classes iRepeat two steps until no change in the means:1. Compute the class posterior for each Gaussian and each point

(a kind of responsibility for a Gaussian for a point)

2. Move the means of the Gaussians to the center of the data, weighted by the responsibilities

iμ

Responsibility:

New mean:

å

å

=

== N

lil

N

llil

i

h

h

1

1x

μ

å=

Q=Q=

Q=Q== m

ulll

lllil

uCxpuCp

iCxpiCph

1)',|()'|(

)',|()'|(


Gaussian mixture model: Gradient ascent

• A set of parameters

Assume unit variance terms and fixed priors C

x

)(Cp

)|( Cxp

{ }mm ,...μ,μμ 2121 ,,.., ppp=Q

þýü

îíì --== - 22/1

21exp)2()|( iμxiCP px

þýü

îíì --=Q -

= =Õå 22/1

1 1 21exp)2()|( il

N

l

m

ii xDP µpp

þýü

îíì --=Q å å

= =

- 2

1 1

2/1

21exp)2(log)( il

N

l

m

ii xl µpp

)()(1å=

-=¶Q¶ N

lilil

i

xhl µµ

- easy on-line update


EM versus gradient ascent

Gradient ascent EM

)(1å=

-+¬N

lililii xh µaµµ

å

å

=

=¬ N

lil

N

llil

i

h

hμ

1

1x

Learning rate No learning rate

Small pull towards distant uncovered data

Renormalized – big jump in thefirst step


K-means approximation to EM

Mixture of Gaussians with the fixed covariance matrix:• posterior measures the responsibility of a Gaussian for every point

• Re-estimation of means:

• K- Means approximations (next lecture)• Only the closest Gaussian is made responsible for a data point

• Results in moving the means of Gaussians to the center of the data points it covered in the previous step

å=

Q=Q=

Q=Q== m

ulll

lllil

uCxpuCp

iCxpiCph

1)',|()'|(

)',|()'|(

å

å

=

== N

lil

N

llil

i

h

h

1

1x

μ

1=ilh If i is the closest Gaussian 0=ilh Otherwise


K-means algorithm

K-Means algorithm:Initialize k values of means (centers)Repeat two steps until no change in the means:– Partition the data according to the current means (using

the similarity measure)– Move the means to the center of the data in the current

partition

Used frequently for clustering data:Basic clustering problem:– distribute data into k different groups such that data points

similar to each other are in the same group

expectation maximization (em) mixture of gaussians

Documents