expectation maximization (em) mixture of gaussians

30
CS 2750 Machine Learning CS 2750 Machine Learning Lecture 18 Milos Hauskrecht [email protected] 5329 Sennott Square Expectation Maximization (EM) Mixture of Gaussians

Upload: others

Post on 27-Jan-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 18

Milos [email protected] Sennott Square

Expectation Maximization (EM)Mixture of Gaussians

Page 2: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Learning probability distributions

Basic learning settings:• A set of random variables • A model of the distribution over variables in X

with parameters • Data s.t.Objective: find parameters that describe p(X) based on DAssumptions considered so far:

– Known parameterizations– No hidden variables – No-missing values

This lecture: learning with hidden variables and missing values

},,,{ 21 nXXX !=X

Q},..,,{ 21 NDDDD =Q̂

),,( 21in

iii xxxD !=

Page 3: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Learning with hidden variables and missing values

Problem: we want to learn from data D • Data consists of instances with values assigned to variables• But sometimes the values are hidden or missing

)(XP

Data 1x 2x 5x

T T F T TT - T F FF T - F T

Missing values

ExtendedData

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h

Hidden variable h(all values missing)

Page 4: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Density estimation

Goal: Find the best set of parameters based on dataEstimation criteria:

– ML

Why hidden vars and missing values are a problem? The parameter estimation for models defined by a set of parameters does not decompose to a set of independent estimation tasks

Approaches for ML estimates: general optimization methods: gradient-ascent, conjugate gradient, Newton-Rhapson, etc

An alternative optimization method: • Expectation-maximization (EM) method

– Suitable when there are missing or hidden values– Takes advantage of the structure of the belief network

),|(max xΘΘ

Dp

),|( xDp Θ– Bayesian

Page 5: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

General EM

The key idea of a method:Compute the parameter estimates iteratively by performing the

following two steps: Two steps of the EM:1. Expectation step. For all hidden and missing variables (and

their possible value assignments) calculate their expectations for the current set of parameters

2. Maximization step. Compute the new estimates of by considering the expectations of the different value completions

Stop when no improvement possible

'ΘΘ

Page 6: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

General EM

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h Calculate expectations for all unknownvalue assignments using

Current parameters estimating P(x) (1) Expectation step. For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters

)',|,( )2()2()2(2 Q== DThTxP

)',|( )1()1( Q= DThP)',|( )1()1( Q= DFhP

)',|,( )2()2()2(2 Q== DFhTxP

)',|,( )2()2()2(2 Q== DThFxP

)',|,( )2()2()2(2 Q== DFhFxP

Page 7: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

General EM

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h Calculate expectations for all unknownvalue assignments using

Current parameters estimating P(x) (1) Expectation step. For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters

)',|,( )2()2()2(2 Q== DThTxP

)',|( )1()1( Q= DThP)',|( )1()1( Q= DFhP

)',|,( )2()2()2(2 Q== DFhTxP

)',|,( )2()2()2(2 Q== DThFxP

)',|,( )2()2()2(2 Q== DFhFxP

Page 8: Expectation Maximization (EM) Mixture of Gaussians

General EM

1x 2x 5x

T T F T T -T - T F F -F T - F T -

h

Current parameters (2) Maximization step. Compute the new improved estimates of parameters by considering the expectations of the different value completions

)',|,( )2()2()2(2 Q== DThTxP

)',|( )1()1( Q= DThP)',|( )1()1( Q= DFhP

)',|,( )2()2()2(2 Q== DFhTxP

)',|,( )2()2()2(2 Q== DThFxP

)',|,( )2()2()2(2 Q== DFhFxP

Θ

Compute the new estimates of Θ

Θ

Page 9: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM detailsLet H – be a set of hidden or missing variable values in dataDerivation

),|(),,|(),|,( xxx QQ=Q DPDHPDHP

),|(log),,|(log),|,(log xxx Q+Q=Q DPDHPDHP

),,|(log),|,(log),|(log xxx Q-Q=Q DHPDHPDP

Average both sides with for some ),',|( xQDHP 'Q),|(log),|,(log),|(log ',|',|',| xxx Q-Q=Q QQQ HPEDHPEDPE DHDHDH

)'|()'|(),|(log QQ+QQ=Q HQDP x

Log-likelihood of data

Log-likelihood of data

Page 10: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM algorithmAlgorithm (general formulation)

Initialize parametersRepeat Set 1. Expectation step

2. Maximization step

until no or small improvement in

Questions: Why this leads to the ML estimate ?What is the advantage of the algorithm?

Q

),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

)'|(maxarg QQ=QQ

Q)'( Q=Q

Q=Q'

Q

Page 11: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM algorithmQuestion: Why is the EM algorithm correct?• Claim: maximizing Q improves the log-likelihood

)'|()'|()( QQ+QQ=Q HQl

)'|'()'|()'|'()'|()'()( QQ-QQ+QQ-QQ=Q-Q HHQQllDifference in log-likelihoods (current and next step)

0³Kullback-Leibler (KL) divergence• distance between 2 distributions)

å QQ-=Q-=QQ Qi

DH DHPDHpDHPEH ),,|(log)',|(),,|(log)'|( ',| xx

0log)|( ³=åi

i

ii R

PPRPKL Is always positive !!!

0),,|(),,'|(log)',|()'|'()'|( ³

QQ

Q=QQ-QQ åi DHP

DHPDHPHHxx

Page 12: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM algorithm

)'|'()'|()'|'()'|()'()( QQ-QQ+QQ-QQ=Q-Q HHQQll

Difference in log-likelihoods

Thusby maximizing Q we improve (!!!) the log-likelihood

)'|'()'|()'()( QQ-QQ³Q-Q QQll

)'|()'|()( QQ+QQ=Q HQl

EM is a first-order optimization procedure• Climbs the gradient• Automatic learning rate

No need to adjust the learning rate !!!!

Page 13: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM advantages

Key advantages:• In many problems (e.g. Bayesian belief networks)

– has a nice form and the maximization of can be carried out in the closed form

• No need to compute Q before maximizing • We directly optimize

– using quantities corresponding to expected counts

),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

)'|( QQQ

Page 14: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Naïve Bayes with a hidden class and missing values

Assume:• is modeled using a Naïve Bayes model with a hidden

class variable C• Missing entries (values) for attributes in the dataset D

)(XP

1X 2X nX…

Hidden class variable

Attributes are independentgiven the class

C

Page 15: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM for the Naïve Bayes

• Use EM to iteratively optimize

• Parameters:prior on class jprobability of an attribute i having value k given class j

• Indicator variables :for example l, the class is j ; if true (=1) else false (=0)for example l, the class is j and the value of attrib i is k

• because the class is hidden and some attributes are missing, the values (0,1) of indicator variables are not known; they are hidden

H – a collection of all indicator variables (that ‘complete’ the data)

),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

jp

ijkq

ljdl

ijkd

1X iX nX…

Cjp

ijkq

Page 16: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM for the Naïve Bayes model

• We can use EM to iteratively learn the parameters of the model),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

Õ ÕÕÕ=

=QN

l i kijk

jj

lijk

ljDHP

1

log),|,(log dd qpx

)loglog(1

ijki k

lijkj

N

l j

lj qdpd åååå +=

=

)log)(log)((),|,(log ',|1

',|',| ijki k

lijkDHj

N

l j

ljDHDH EEDHPE qdpdx åååå Q

=QQ +=Q

)',|()(',| Q==Q llljDH DjCpE d

)',|,()(',| Q===Q llillijkDH DjCkXpE d

Substitutes 0,1with the expected value

Complete loglikelihood

Expectation over H

Page 17: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM for the Naïve Bayes model

• Computing derivatives of Q for parameters and setting it to 0 we get:

• Important:– Use expected counts instead of counts !!!– Re-estimate the parameters using expected counts

å=

=ir

kijk

ijkijk

N

N

1

~

~q

NN j

j

~=p

)',|()(~1

',|1

Q=== åå=

Q=

ll

N

l

ljDH

N

lj DjCpEN d

)',|,()(~1

',|1

Q==== åå=

Q=

llil

N

l

lijkDH

N

lijk DjCkXpEN d

Page 18: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM for BBNs

• The same result applies to learning of parameters of any Bayesian belief network with discrete-valued variables

• Again:– Use expected counts instead of counts

å=

Q===N

l

lli

liijk DjpakxpN

1)',|,(~

å=

=ir

kijk

ijkijk

N

N

1

~

~q

may require inference

),|,(log)'|( ',| xQ=QQ Q DHPEQ DH

Parameter value maximizing Q

Page 19: Expectation Maximization (EM) Mixture of Gaussians

Gaussian mixture model

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Goal: We want a model of p(x)Problem: a multidimensional Gaussian not a good fit to data

Page 20: Expectation Maximization (EM) Mixture of Gaussians

Gaussian mixture model

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Goal: We want a density model of p(x)Problem: a multidimensional Gaussian not a good fit to dataBut three different Gaussians may do well

Page 21: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Recall QDAGenerative classifier model:• models of p(x,C) as p(x|C).p(C)• Class labels are knownModel:

= probability of a data instance coming from class C=i

= class conditional density (modeled as a Gaussian)for class i

X

)|( iCp =X

C)(CP

),()|( iiNiCp Σμx »=

)( iCp =

Page 22: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Recall QDAGenerative classifier model:• models of p(x,C) as p(x|C).p(C)• Class labels are knownThe ML estimate of parameters

å=

=iCj

il

N:1

NNi

i =p~

å=

=iCjj

ii

lN :

1~ xμ

Tiji

iCjj

ii

lN

))((1~:

μxμxΣ --= å=

X

)|( iCp =X

C)(CP

Page 23: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Gaussian mixture modelThe model of p(x) based on the model of p(x,C)• Class labels are not known

The same model as QDA:

= probability of a data point coming from class C=i

= class conditional density (modeled as a Gaussian)for class i

Important: C values are hidden !!!!

)|()()(1

iCpiCppk

i===å

=

xx

),()|( iiNiCp Σμx »=

X

)|( iCp =X

C)(CP

)( iCp =

Page 24: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Gaussian mixture model

Gaussian mixture: we do not know what Gaussian generates x• We can apply EM algorithm:

– re-estimation based on the class posterior

å=

Q=Q=

Q=Q==Q== m

ulll

lllllil

uCxpuCp

iCxpiCpiCph

1)',|()'|(

)',|()'|()',|( x

å=l

ili hN

NNi

i =p~

å=l

jili

i hN

xμ 1~

Tiji

ljil

ii hN

))((1~ μxμxΣ --= å

Count replaced with the expected count

l

l l

Page 25: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Gaussian mixture

• Density function p(x) for the Mixture of Gaussians model

Page 26: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

GM algorithmAssume special case of the GM: a fixed and shared covariance matrix for all hidden groups and uniform prior on groups• Algorithm:

Initialize means for all classes iRepeat two steps until no change in the means:1. Compute the class posterior for each Gaussian and each point

(a kind of responsibility for a Gaussian for a point)

2. Move the means of the Gaussians to the center of the data, weighted by the responsibilities

Responsibility:

New mean:

å

å

=

== N

lil

N

llil

i

h

h

1

1x

μ

å=

Q=Q=

Q=Q== m

ulll

lllil

uCxpuCp

iCxpiCph

1)',|()'|(

)',|()'|(

Page 27: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

Gaussian mixture model: Gradient ascent

• A set of parameters

Assume unit variance terms and fixed priors C

x

)(Cp

)|( Cxp

{ }mm ,...μ,μμ 2121 ,,.., ppp=Q

þýü

îíì --== - 22/1

21exp)2()|( iμxiCP px

þýü

îíì --=Q -

= =Õå 22/1

1 1 21exp)2()|( il

N

l

m

ii xDP µpp

þýü

îíì --=Q å å

= =

- 2

1 1

2/1

21exp)2(log)( il

N

l

m

ii xl µpp

)()(1å=

-=¶Q¶ N

lilil

i

xhl µµ

- easy on-line update

Page 28: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

EM versus gradient ascent

Gradient ascent EM

)(1å=

-+¬N

lililii xh µaµµ

å

å

=

=¬ N

lil

N

llil

i

h

1

1x

Learning rate No learning rate

Small pull towards distant uncovered data

Renormalized – big jump in thefirst step

Page 29: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

K-means approximation to EM

Mixture of Gaussians with the fixed covariance matrix:• posterior measures the responsibility of a Gaussian for every point

• Re-estimation of means:

• K- Means approximations (next lecture)• Only the closest Gaussian is made responsible for a data point

• Results in moving the means of Gaussians to the center of the data points it covered in the previous step

å=

Q=Q=

Q=Q== m

ulll

lllil

uCxpuCp

iCxpiCph

1)',|()'|(

)',|()'|(

å

å

=

== N

lil

N

llil

i

h

h

1

1x

μ

1=ilh If i is the closest Gaussian 0=ilh Otherwise

Page 30: Expectation Maximization (EM) Mixture of Gaussians

CS 2750 Machine Learning

K-means algorithm

K-Means algorithm:Initialize k values of means (centers)Repeat two steps until no change in the means:– Partition the data according to the current means (using

the similarity measure)– Move the means to the center of the data in the current

partition

Used frequently for clustering data:Basic clustering problem:– distribute data into k different groups such that data points

similar to each other are in the same group