Machine Learning for Data MiningExpectation Maximization
Andres Mendez-Vazquez
July 12, 2015
1 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
2 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
3 / 72
Maximum-Likelihood
We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}
This data is known as evidence.
We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.
4 / 72
Maximum-Likelihood
We have a density function p (x|Θ)Assume that we have a data set of size N , X = {x1,x2, ...,xN}
This data is known as evidence.
We assume in addition thatThe vectors are independent and identically distributed (i.i.d.) withdistribution p under parameter θ.
4 / 72
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)
Or, given a new observation x̃
p (x̃|X ) (2)
I.e. to compute the probability of the new observation being supported bythe evidence X .
ThusThe former represents parameter estimation and the latter data prediction.
5 / 72
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)
Or, given a new observation x̃
p (x̃|X ) (2)
I.e. to compute the probability of the new observation being supported bythe evidence X .
ThusThe former represents parameter estimation and the latter data prediction.
5 / 72
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (1)
Or, given a new observation x̃
p (x̃|X ) (2)
I.e. to compute the probability of the new observation being supported bythe evidence X .
ThusThe former represents parameter estimation and the latter data prediction.
5 / 72
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)
Interpreted as
posterior = likelihood × priorevidence (4)
Thus, we want
likelihood = P (X|Θ)
6 / 72
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)
Interpreted as
posterior = likelihood × priorevidence (4)
Thus, we want
likelihood = P (X|Θ)
6 / 72
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X ) = P (X|Θ) P (Θ)P (X ) (3)
Interpreted as
posterior = likelihood × priorevidence (4)
Thus, we want
likelihood = P (X|Θ)
6 / 72
What we want...
We want to maximize the likelihood as a function of θ
7 / 72
Maximum-Likelihood
We have
p (x1,x2, ...,xN |Θ) =N∏
i=1p (xi |Θ) (5)
Also known as the likelihood function.
oblematicBecause multiplication of number less than one can beproblematic
L (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log p (xi |Θ) (6)
8 / 72
Maximum-Likelihood
We have
p (x1,x2, ...,xN |Θ) =N∏
i=1p (xi |Θ) (5)
Also known as the likelihood function.
oblematicBecause multiplication of number less than one can beproblematic
L (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log p (xi |Θ) (6)
8 / 72
Maximum-Likelihood
We want to find a Θ∗
Θ∗ = argmaxΘL (Θ|X ) (7)
The classic method∂L (Θ|X )
∂θi= 0 ∀θi ∈ Θ (8)
9 / 72
Maximum-Likelihood
We want to find a Θ∗
Θ∗ = argmaxΘL (Θ|X ) (7)
The classic method∂L (Θ|X )
∂θi= 0 ∀θi ∈ Θ (8)
9 / 72
What happened if we have incomplete data
Data could have been split1 X = observed data or incomplete data2 Y = unobserved data
For this type of problemsWe have the famous Expectation Maximization (EM)
10 / 72
What happened if we have incomplete data
Data could have been split1 X = observed data or incomplete data2 Y = unobserved data
For this type of problemsWe have the famous Expectation Maximization (EM)
10 / 72
What happened if we have incomplete data
Data could have been split1 X = observed data or incomplete data2 Y = unobserved data
For this type of problemsWe have the famous Expectation Maximization (EM)
10 / 72
The Expectation Maximization
The EM algorithmIt was first developed by Dempster et al. (1977).
Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.
Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
11 / 72
The Expectation Maximization
The EM algorithmIt was first developed by Dempster et al. (1977).
Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.
Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
11 / 72
The Expectation Maximization
The EM algorithmIt was first developed by Dempster et al. (1977).
Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.
Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
11 / 72
The Expectation Maximization
The EM algorithmIt was first developed by Dempster et al. (1977).
Its popularity comes from the factIt can estimate an underlying distribution when data is incomplete or hasmissing values.
Two main applications1 When missing values exists.2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
11 / 72
Incomplete Data
We assume the followingTwo parts of data
1 X = observed data or incomplete data2 Y = unobserved data
Thus
Z = (X ,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)
12 / 72
Incomplete Data
We assume the followingTwo parts of data
1 X = observed data or incomplete data2 Y = unobserved data
Thus
Z = (X ,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)
12 / 72
Incomplete Data
We assume the followingTwo parts of data
1 X = observed data or incomplete data2 Y = unobserved data
Thus
Z = (X ,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)
12 / 72
Incomplete Data
We assume the followingTwo parts of data
1 X = observed data or incomplete data2 Y = unobserved data
Thus
Z = (X ,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)
12 / 72
Incomplete Data
We assume the followingTwo parts of data
1 X = observed data or incomplete data2 Y = unobserved data
Thus
Z = (X ,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x,y|Θ) = p (y|x,Θ) p (x|Θ) (10)
12 / 72
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)
Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!
13 / 72
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)
Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!
13 / 72
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X ,Y) = p (X ,Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X ,Y) = p (X ,Y|Θ) = p (Y|X ,Θ) p (X|Θ) (12)
Did you notice?p (X|Θ) is the likelihood of the observed data.p (Y|X ,Θ) is the likelihood of the no-observed data under theobserved data!!!
13 / 72
Rewriting
This can be rewritten as
L (Θ|X ,Y) = hX ,Θ (Y) (13)
This basically signify that X ,Θ are constant and the only random part isY.
In addition
L (Θ|X ) (14)
It is known as the incomplete-data likelihood function.
14 / 72
Rewriting
This can be rewritten as
L (Θ|X ,Y) = hX ,Θ (Y) (13)
This basically signify that X ,Θ are constant and the only random part isY.
In addition
L (Θ|X ) (14)
It is known as the incomplete-data likelihood function.
14 / 72
Thus
We can connect both incomplete-complete data equations by doingthe following
L (Θ|X ) =p (X|Θ)=∑Y
p (X ,Y|Θ)
=∑Y
p (Y|X ,Θ) p (X|Θ)
=∑( N∏
i=1p (xi |Θ)
)Y
p (Y|X ,Θ)
15 / 72
Thus
We can connect both incomplete-complete data equations by doingthe following
L (Θ|X ) =p (X|Θ)=∑Y
p (X ,Y|Θ)
=∑Y
p (Y|X ,Θ) p (X|Θ)
=∑( N∏
i=1p (xi |Θ)
)Y
p (Y|X ,Θ)
15 / 72
Thus
We can connect both incomplete-complete data equations by doingthe following
L (Θ|X ) =p (X|Θ)=∑Y
p (X ,Y|Θ)
=∑Y
p (Y|X ,Θ) p (X|Θ)
=∑( N∏
i=1p (xi |Θ)
)Y
p (Y|X ,Θ)
15 / 72
Thus
We can connect both incomplete-complete data equations by doingthe following
L (Θ|X ) =p (X|Θ)=∑Y
p (X ,Y|Θ)
=∑Y
p (Y|X ,Θ) p (X|Θ)
=∑( N∏
i=1p (xi |Θ)
)Y
p (Y|X ,Θ)
15 / 72
Remarks
ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.
HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.
16 / 72
Remarks
ProblemsNormally, it is almost impossible to obtain a closed analytical solution forthe previous equation.
HoweverWe can use the expected value of log p (X ,Y|Θ), which allows us to findan iterative procedure to approximate the solution.
16 / 72
The function we would like to have...
The Q functionWe want an estimation of the complete-data log-likelihood
log p (X ,Y|Θ) (15)
Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.
Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).
17 / 72
The function we would like to have...
The Q functionWe want an estimation of the complete-data log-likelihood
log p (X ,Y|Θ) (15)
Based in the info provided by X ,Θn−1 where Θn−1 is a previouslyestimated set of parameters at step n.
Think about the following, if we want to remove Yˆ[log p (X ,Y|Θ)] p (Y|X ,Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value oflog p (X ,Y|Θ).
17 / 72
Use the Expected value
ThenWe want
Q(Θ,Θ(i−1)
)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)
Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution
p (Y|X ,Θn−1)=marginal distribution of missing data.
18 / 72
Use the Expected value
ThenWe want
Q(Θ,Θ(i−1)
)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)
Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution
p (Y|X ,Θn−1)=marginal distribution of missing data.
18 / 72
Use the Expected value
ThenWe want
Q(Θ,Θ(i−1)
)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)
Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution
p (Y|X ,Θn−1)=marginal distribution of missing data.
18 / 72
Use the Expected value
ThenWe want
Q(Θ,Θ(i−1)
)= E [log p (X ,Y|Θ) |X ,Θn−1] (17)
Take in account that1 X ,Θn−1 are taken as constants.2 Θ is a normal variable that we wish to adjust.3 Y is a random variable governed by distribution
p (Y|X ,Θn−1)=marginal distribution of missing data.
18 / 72
Another Interpretation
Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =
´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY
Something Notable1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
19 / 72
Another Interpretation
Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =
´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY
Something Notable1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
19 / 72
Another Interpretation
Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =
´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY
Something Notable1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
19 / 72
Another Interpretation
Given the previous informationE [log p (X ,Y|Θ) |X ,Θn−1] =
´Y∈Y log p (X ,Y|Θ) p (Y|X ,Θn−1) dY
Something Notable1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y,X|Θn−1) = p (Y|X ,Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
19 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
20 / 72
Back to the Q function
The intuitionWe have the following analogy:
Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ,Y )] =∑
yh (θ, y) pY (y) (19)
21 / 72
Back to the Q function
The intuitionWe have the following analogy:
Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ,Y )] =∑
yh (θ, y) pY (y) (19)
21 / 72
Back to the Q function
The intuitionWe have the following analogy:
Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ,Y )] =∑
yh (θ, y) pY (y) (19)
21 / 72
Back to the Q function
The intuitionWe have the following analogy:
Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ,Y )] =∑
yh (θ, y) pY (y) (19)
21 / 72
Back to the Q function
The intuitionWe have the following analogy:
Consider h (θ,Y ) a functionI θ a constantI Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ,Y )] =∑
yh (θ, y) pY (y) (19)
21 / 72
Why E-step!!!
From here the nameThis is basically the E-step
The second stepIt tries to maximize the Q function
Θn = argmaxΘQ (Θ,Θn−1) (20)
22 / 72
Why E-step!!!
From here the nameThis is basically the E-step
The second stepIt tries to maximize the Q function
Θn = argmaxΘQ (Θ,Θn−1) (20)
22 / 72
Why E-step!!!
From here the nameThis is basically the E-step
The second stepIt tries to maximize the Q function
Θn = argmaxΘQ (Θ,Θn−1) (20)
22 / 72
Derivation of the EM-Algorithm
The likelihood function we are going to useLet X be a random vector which results from a parametrized family:
L (×) = lnP (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)
23 / 72
Derivation of the EM-Algorithm
The likelihood function we are going to useLet X be a random vector which results from a parametrized family:
L (×) = lnP (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)
23 / 72
Derivation of the EM-Algorithm
The likelihood function we are going to useLet X be a random vector which results from a parametrized family:
L (×) = lnP (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute ΘBased on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ)− L (Θn) = lnP (X|Θ)− lnP (X|Θn) (22)
23 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
24 / 72
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =∑
yP (X|y,Θ)P (y|Θ) (23)
Thus
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn) (24)
25 / 72
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =∑
yP (X|y,Θ)P (y|Θ) (23)
Thus
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn) (24)
25 / 72
Here, we introduce some concepts of convexity
Now, we haveTheorem (Jensen’s inequality)
Let f be a convex function defined on an interval I . If x1, x2, ..., xn ∈ Iand λ1, λ2, ..., λn ≥ 0 with
∑ni=1 λi = 1, then
f( n∑
i=1λixi
)≤
n∑i=1
λi f (xi) (25)
26 / 72
Proof:
For n = 1We have the trivial case
For n = 2The convexity definition.
Now the inductive hypothesisWe assume that the theorem is true for some n.
27 / 72
Proof:
For n = 1We have the trivial case
For n = 2The convexity definition.
Now the inductive hypothesisWe assume that the theorem is true for some n.
27 / 72
Proof:
For n = 1We have the trivial case
For n = 2The convexity definition.
Now the inductive hypothesisWe assume that the theorem is true for some n.
27 / 72
Now, we have
The following linear combination for λi
f(n+1∑
i=1λixi
)= f
(λn+1xn+1 +
n∑i=1
λixi
)
= f(λn+1xn+1 + (1− λn+1)
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) f(
1(1− λn+1)
n∑i=1
λixi
)
28 / 72
Now, we have
The following linear combination for λi
f(n+1∑
i=1λixi
)= f
(λn+1xn+1 +
n∑i=1
λixi
)
= f(λn+1xn+1 + (1− λn+1)
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) f(
1(1− λn+1)
n∑i=1
λixi
)
28 / 72
Now, we have
The following linear combination for λi
f(n+1∑
i=1λixi
)= f
(λn+1xn+1 +
n∑i=1
λixi
)
= f(λn+1xn+1 + (1− λn+1)
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) f(
1(1− λn+1)
n∑i=1
λixi
)
28 / 72
Did you notice?
Something Notablen+1∑i=1
λi = 1
Thusn∑
i=1λi = 1− λn+1
Finally1
(1− λn+1)
n∑i=1
λi = 1
29 / 72
Did you notice?
Something Notablen+1∑i=1
λi = 1
Thusn∑
i=1λi = 1− λn+1
Finally1
(1− λn+1)
n∑i=1
λi = 1
29 / 72
Did you notice?
Something Notablen+1∑i=1
λi = 1
Thusn∑
i=1λi = 1− λn+1
Finally1
(1− λn+1)
n∑i=1
λi = 1
29 / 72
Now
We have that
f(n+1∑
i=1λixi
)≤ λn+1f (xn+1) + (1− λn+1) f
(1
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)
n∑i=1
λi f (xi)
≤ λn+1f (xn+1) +n∑
i=1λi f (xi) Q.E.D.
30 / 72
Now
We have that
f(n+1∑
i=1λixi
)≤ λn+1f (xn+1) + (1− λn+1) f
(1
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)
n∑i=1
λi f (xi)
≤ λn+1f (xn+1) +n∑
i=1λi f (xi) Q.E.D.
30 / 72
Now
We have that
f(n+1∑
i=1λixi
)≤ λn+1f (xn+1) + (1− λn+1) f
(1
(1− λn+1)
n∑i=1
λixi
)
≤ λn+1f (xn+1) + (1− λn+1) 1(1− λn+1)
n∑i=1
λi f (xi)
≤ λn+1f (xn+1) +n∑
i=1λi f (xi) Q.E.D.
30 / 72
Thus...
It is possible to shown thatGiven ln (x) a concave function =⇒ln [
∑ni=1 λixi ] ≥
∑ni=1 λi ln (xi).
If we take in considerationAssume that the λi = P (y|X ,Θn). We know that
1 P (y|X ,Θn) ≥ 02∑
y P (y|X ,Θn) = 1
31 / 72
Thus...
It is possible to shown thatGiven ln (x) a concave function =⇒ln [
∑ni=1 λixi ] ≥
∑ni=1 λi ln (xi).
If we take in considerationAssume that the λi = P (y|X ,Θn). We know that
1 P (y|X ,Θn) ≥ 02∑
y P (y|X ,Θn) = 1
31 / 72
We have then...
First
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn)
= ln(∑
yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)
P (y|X ,Θn)
)− lnP (X|Θn)
= ln(∑
yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)
P (y|X ,Θn)
)− lnP (X|Θn)
≥∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)
)− ...
∑yP (y|X ,Θn) lnP (X|Θn) Why this?
32 / 72
We have then...
First
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn)
= ln(∑
yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)
P (y|X ,Θn)
)− lnP (X|Θn)
= ln(∑
yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)
P (y|X ,Θn)
)− lnP (X|Θn)
≥∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)
)− ...
∑yP (y|X ,Θn) lnP (X|Θn) Why this?
32 / 72
We have then...
First
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn)
= ln(∑
yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)
P (y|X ,Θn)
)− lnP (X|Θn)
= ln(∑
yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)
P (y|X ,Θn)
)− lnP (X|Θn)
≥∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)
)− ...
∑yP (y|X ,Θn) lnP (X|Θn) Why this?
32 / 72
We have then...
First
L (Θ)− L (Θn) = ln(∑
yP (X|y,Θ)P (y|Θ)
)− lnP (X|Θn)
= ln(∑
yP (X|y,Θ)P (y|Θ) P (y|X ,Θn)
P (y|X ,Θn)
)− lnP (X|Θn)
= ln(∑
yP (y|X ,Θn) P (X|y,Θ)P (y|Θ)
P (y|X ,Θn)
)− lnP (X|Θn)
≥∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)
)− ...
∑yP (y|X ,Θn) lnP (X|Θn) Why this?
32 / 72
Next
Because ∑yP (y|X ,Θn) = 1
Then
L (Θ)− L (Θn) =∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)∆=∆ (Θ|Θn)
33 / 72
Next
Because ∑yP (y|X ,Θn) = 1
Then
L (Θ)− L (Θn) =∑
yP (y|X ,Θn) ln
(P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)∆=∆ (Θ|Θn)
33 / 72
Then, we have
Then
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
We define then
l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
34 / 72
Then, we have
Then
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
We define then
l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
34 / 72
Then, we have
Then
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
We define then
l (Θ|Θn) ∆= L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
34 / 72
Now, we can do the following
We have
l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)
=L (Θn) +∑
yP (y|X ,Θn) ln
( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)
)
=L (Θn) +∑
yP (y|X ,Θn) ln
(P (X , y|Θn)P (X , y|Θn)
)=L (Θn)
This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal
35 / 72
Now, we can do the following
We have
l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)
=L (Θn) +∑
yP (y|X ,Θn) ln
( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)
)
=L (Θn) +∑
yP (y|X ,Θn) ln
(P (X , y|Θn)P (X , y|Θn)
)=L (Θn)
This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal
35 / 72
Now, we can do the following
We have
l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)
=L (Θn) +∑
yP (y|X ,Θn) ln
( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)
)
=L (Θn) +∑
yP (y|X ,Θn) ln
(P (X , y|Θn)P (X , y|Θn)
)=L (Θn)
This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal
35 / 72
Now, we can do the following
We have
l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)
=L (Θn) +∑
yP (y|X ,Θn) ln
( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)
)
=L (Θn) +∑
yP (y|X ,Θn) ln
(P (X , y|Θn)P (X , y|Θn)
)=L (Θn)
This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal
35 / 72
Now, we can do the following
We have
l (Θn |Θn) =L (Θn) + ∆ (Θ|Θn)
=L (Θn) +∑
yP (y|X ,Θn) ln
( P (X|y,Θn)P (y|Θn)P (y|X ,Θn)P (X|Θn)
)
=L (Θn) +∑
yP (y|X ,Θn) ln
(P (X , y|Θn)P (X , y|Θn)
)=L (Θn)
This means thatFor Θ = Θn , functions L (Θ) and l (Θ|Θn) are equal
35 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
36 / 72
What are we looking for...
Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.
The process can be seen in the following graph
37 / 72
What are we looking for...
Thus, we want toTo select Θ such that l (Θ|Θn) is maximized i.e given a Θn , we want togenerate Θn+1.
The process can be seen in the following graph
37 / 72
Formally, we want
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ
{L (Θn) +
∑y
P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)}The terms with Θn are constants.
≈argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))
}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln
(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)
P (y,Θ)P (Θ)
)}
38 / 72
Formally, we want
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ
{L (Θn) +
∑y
P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)}The terms with Θn are constants.
≈argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))
}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln
(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)
P (y,Θ)P (Θ)
)}
38 / 72
Formally, we want
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ
{L (Θn) +
∑y
P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)}The terms with Θn are constants.
≈argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))
}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln
(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)
P (y,Θ)P (Θ)
)}
38 / 72
Formally, we want
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ
{L (Θn) +
∑y
P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)}The terms with Θn are constants.
≈argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))
}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln
(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)
P (y,Θ)P (Θ)
)}
38 / 72
Formally, we want
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ
{L (Θn) +
∑y
P (y|X ,Θn) ln( P (X|y,Θ)P (y|Θ)P (y|X ,Θn)P (X|Θn)
)}The terms with Θn are constants.
≈argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X|y,Θ)P (y|Θ))
}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y|Θ)P (y|Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln
(P(X ,y,Θ)P(Θ)P(y,Θ)P(Θ)
P (y,Θ)P (Θ)
)}
38 / 72
Thus
Then
θn+1 =argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X , y|Θ))
}=argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}
39 / 72
Thus
Then
θn+1 =argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X , y|Θ))
}=argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}
39 / 72
Thus
Then
θn+1 =argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X , y|Θ))
}=argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}
39 / 72
Thus
Then
θn+1 =argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X , y|Θ))
}=argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}
39 / 72
Thus
Then
θn+1 =argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (y,Θ)
P (y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln(P (X , y,Θ)P (Θ)
)}
=argmaxΘ
{∑y
P (y|X ,Θn) ln (P (X , y|Θ))
}=argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ
{Ey|X ,Θn [ln (P (X , y|Θ))]
}
39 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
40 / 72
The EM-Algorithm
Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.
E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].
M-StepMaximize this expression with respect to Θ.
41 / 72
The EM-Algorithm
Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.
E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].
M-StepMaximize this expression with respect to Θ.
41 / 72
The EM-Algorithm
Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.
E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].
M-StepMaximize this expression with respect to Θ.
41 / 72
The EM-Algorithm
Steps of EM1 Expectation under hidden variables.2 Maximization of the resulting formula.
E-StepDetermine the conditional expectation, Ey|X ,Θn [ln (P (X , y|Θ))].
M-StepMaximize this expression with respect to Θ.
41 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
42 / 72
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).
Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize
∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.
43 / 72
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).
Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize
∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.
43 / 72
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).
Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize
∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.
43 / 72
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).
Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize
∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.
43 / 72
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)Using the hidden variables it is possible to simplify the optimization ofL (Θ) through l (Θ|Θn).
Convergence1 Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).2 Then given the initial estimate of Θ by Θn ⇒ ∆ (Θn |Θn) = 0.3 Now, Θn+1 is chosen to maximize
∆ (Θ|Θn)⇒ ∆ (Θn+1|Θn) ≥ ∆ (Θn |Θn) = 0.1 The Likelihood L (Θ) is not decreasing function with respect to Θ.
43 / 72
Notes and Convergence of EM
PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).
For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.
44 / 72
Notes and Convergence of EM
PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).
For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.
44 / 72
Notes and Convergence of EM
PropertiesWhen the algorithm reaches a fixed point for some Θn the valuemaximizes l (Θ|Θn).Because at that point L and l are equal at Θn =⇒ If L and l aredifferentiable at Θn , then Θn is a stationary point of L i.e. thederivative of L vanishes at that point (Saddle Point).
For more in thisGeoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithmand Extensions,” John Wiley & Sons, New York, 1996.
44 / 72
Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.
We have
p (x|Θ) =M∑
i=1αipi (x|θi) (28)
where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M
i=1 αi = 13 Each pi is a density function parametrized by θi .
45 / 72
Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.
We have
p (x|Θ) =M∑
i=1αipi (x|θi) (28)
where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M
i=1 αi = 13 Each pi is a density function parametrized by θi .
45 / 72
Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.
We have
p (x|Θ) =M∑
i=1αipi (x|θi) (28)
where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M
i=1 αi = 13 Each pi is a density function parametrized by θi .
45 / 72
Finding Maximum Likelihood Mixture Densities Parametersvia EMSomething NotableThe mixture-density parameter estimation problem is probably one of themost widely used applications of the EM algorithm in the computationalpattern recognition community.
We have
p (x|Θ) =M∑
i=1αipi (x|θi) (28)
where1 Θ = (α1, ..., αM , θ1, ..., θM )2∑M
i=1 αi = 13 Each pi is a density function parametrized by θi .
45 / 72
A log-likelihood for this function
We have
logL (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log
M∑j=1
αjpj (xi |θj)
(29)
Note: This is too difficult to optimize due to the log function.
HoweverWe can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}
2 yi = k if the i th samples was generated by the kth mixture.
46 / 72
A log-likelihood for this function
We have
logL (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log
M∑j=1
αjpj (xi |θj)
(29)
Note: This is too difficult to optimize due to the log function.
HoweverWe can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}
2 yi = k if the i th samples was generated by the kth mixture.
46 / 72
A log-likelihood for this function
We have
logL (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log
M∑j=1
αjpj (xi |θj)
(29)
Note: This is too difficult to optimize due to the log function.
HoweverWe can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}
2 yi = k if the i th samples was generated by the kth mixture.
46 / 72
A log-likelihood for this function
We have
logL (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log
M∑j=1
αjpj (xi |θj)
(29)
Note: This is too difficult to optimize due to the log function.
HoweverWe can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}
2 yi = k if the i th samples was generated by the kth mixture.
46 / 72
A log-likelihood for this function
We have
logL (Θ|X ) = logN∏
i=1p (xi |Θ) =
N∑i=1
log
M∑j=1
αjpj (xi |θj)
(29)
Note: This is too difficult to optimize due to the log function.
HoweverWe can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}Ni=1 has a thefollowing range yi ∈ {1, ...,M}
2 yi = k if the i th samples was generated by the kth mixture.
46 / 72
NowWe have
logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)
Thus
log [P (X ,Y|Θ)] =N∑
i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
logL (Θ|X ,Y) =N∑
i=1log [P (xi |yi , θyi ) P (yi)] =
N∑i=1
log [αyi pyi (xi |θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.47 / 72
NowWe have
logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)
Thus
log [P (X ,Y|Θ)] =N∑
i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
logL (Θ|X ,Y) =N∑
i=1log [P (xi |yi , θyi ) P (yi)] =
N∑i=1
log [αyi pyi (xi |θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.47 / 72
NowWe have
logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)
Thus
log [P (X ,Y|Θ)] =N∑
i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
logL (Θ|X ,Y) =N∑
i=1log [P (xi |yi , θyi ) P (yi)] =
N∑i=1
log [αyi pyi (xi |θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.47 / 72
NowWe have
logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)
Thus
log [P (X ,Y|Θ)] =N∑
i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
logL (Θ|X ,Y) =N∑
i=1log [P (xi |yi , θyi ) P (yi)] =
N∑i=1
log [αyi pyi (xi |θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.47 / 72
NowWe have
logL (Θ|X ,Y) = log [P (X ,Y|Θ)] (30)
Thus
log [P (X ,Y|Θ)] =N∑
i=1log [P (xi |yi , θyi ) P (yi |θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
logL (Θ|X ,Y) =N∑
i=1log [P (xi |yi , θyi ) P (yi)] =
N∑i=1
log [αyi pyi (xi |θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.47 / 72
Problem
Which?We do not know the values of Y.
We can get away by using the following ideaAssume the Y is a random variable.
48 / 72
Problem
Which?We do not know the values of Y.
We can get away by using the following ideaAssume the Y is a random variable.
48 / 72
Thus
You do a first guess for the parameters
Θg = (αg1, ..., α
gM , θ
g1, ..., θ
gM ) (33)
Then, it is possible to calculate
pj(xi |θg
j
)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:
αj = p (component j) (34)
49 / 72
Thus
You do a first guess for the parameters
Θg = (αg1, ..., α
gM , θ
g1, ..., θ
gM ) (33)
Then, it is possible to calculate
pj(xi |θg
j
)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:
αj = p (component j) (34)
49 / 72
Thus
You do a first guess for the parameters
Θg = (αg1, ..., α
gM , θ
g1, ..., θ
gM ) (33)
Then, it is possible to calculate
pj(xi |θg
j
)In additionThe mixing parameters αj can be though of as a prior probabilities of eachmixture:
αj = p (component j) (34)
49 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
50 / 72
Using Bayes’ RuleCompute
p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)
=p (xi |Θg) p
(yi |θg
yi
)p (xi |Θg) We know θg
yi ⇒ Drop it
=αg
yi pyi
(xi |θg
yi
)p (xi |Θg)
=αg
yi pyi
(xi |θg
yi
)∑Mk=1 α
gkpk (xi |θg
k)
In addition
p (y|X ,Θg) =N∏
i=1
p (yi |xi ,Θg) (35)
Where y = (y1, y2, ..., yN )
51 / 72
Using Bayes’ RuleCompute
p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)
=p (xi |Θg) p
(yi |θg
yi
)p (xi |Θg) We know θg
yi ⇒ Drop it
=αg
yi pyi
(xi |θg
yi
)p (xi |Θg)
=αg
yi pyi
(xi |θg
yi
)∑Mk=1 α
gkpk (xi |θg
k)
In addition
p (y|X ,Θg) =N∏
i=1
p (yi |xi ,Θg) (35)
Where y = (y1, y2, ..., yN )
51 / 72
Using Bayes’ RuleCompute
p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)
=p (xi |Θg) p
(yi |θg
yi
)p (xi |Θg) We know θg
yi ⇒ Drop it
=αg
yi pyi
(xi |θg
yi
)p (xi |Θg)
=αg
yi pyi
(xi |θg
yi
)∑Mk=1 α
gkpk (xi |θg
k)
In addition
p (y|X ,Θg) =N∏
i=1
p (yi |xi ,Θg) (35)
Where y = (y1, y2, ..., yN )
51 / 72
Using Bayes’ RuleCompute
p (yi |xi ,Θg) =p (yi , xi |Θg)p (xi |Θg)
=p (xi |Θg) p
(yi |θg
yi
)p (xi |Θg) We know θg
yi ⇒ Drop it
=αg
yi pyi
(xi |θg
yi
)p (xi |Θg)
=αg
yi pyi
(xi |θg
yi
)∑Mk=1 α
gkpk (xi |θg
k)
In addition
p (y|X ,Θg) =N∏
i=1
p (yi |xi ,Θg) (35)
Where y = (y1, y2, ..., yN )
51 / 72
Now, using equation 17
Then
Q (Θ|Θg) =∑y∈Y
log (L (Θ|X , y)) p (y|X ,Θg)
=∑y∈Y
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
M∑l=1
δl,yi log [αlpl (xi |θl)]N∏
j=1
p (yj |xj ,Θg)
=N∑
i=1
M∑l=1
log [αlpl (xi |θl)]M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg)
52 / 72
Now, using equation 17
Then
Q (Θ|Θg) =∑y∈Y
log (L (Θ|X , y)) p (y|X ,Θg)
=∑y∈Y
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
M∑l=1
δl,yi log [αlpl (xi |θl)]N∏
j=1
p (yj |xj ,Θg)
=N∑
i=1
M∑l=1
log [αlpl (xi |θl)]M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg)
52 / 72
Now, using equation 17
Then
Q (Θ|Θg) =∑y∈Y
log (L (Θ|X , y)) p (y|X ,Θg)
=∑y∈Y
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
M∑l=1
δl,yi log [αlpl (xi |θl)]N∏
j=1
p (yj |xj ,Θg)
=N∑
i=1
M∑l=1
log [αlpl (xi |θl)]M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg)
52 / 72
Now, using equation 17
Then
Q (Θ|Θg) =∑y∈Y
log (L (Θ|X , y)) p (y|X ,Θg)
=∑y∈Y
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
M∑l=1
δl,yi log [αlpl (xi |θl)]N∏
j=1
p (yj |xj ,Θg)
=N∑
i=1
M∑l=1
log [αlpl (xi |θl)]M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg)
52 / 72
Now, using equation 17
Then
Q (Θ|Θg) =∑y∈Y
log (L (Θ|X , y)) p (y|X ,Θg)
=∑y∈Y
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
log [αyi pyi (xi |θyi )]N∏
j=1
p (yj |xj ,Θg)
=M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
N∑i=1
M∑l=1
δl,yi log [αlpl (xi |θl)]N∏
j=1
p (yj |xj ,Θg)
=N∑
i=1
M∑l=1
log [αlpl (xi |θl)]M∑
y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg)
52 / 72
It looks quite daunting, but...First notice the following
M∑y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg) =
=
M∑y1=1
· · ·M∑
yi−1=1
M∑yi+1=1
· · ·M∑
yN =1
N∏j=1,j 6=i
p (yj |xj ,Θg)
p (l|xi ,Θg)
=N∏
j=1,j 6=i
M∑yj=1
p (yj |xj ,Θg)
p (l|xi ,Θg)
=p (l|xi ,Θg) =αg
l pyi
(xi |θg
l
)∑Mk=1 α
gkpk(
xi |θgk
)Because
M∑i=1
p (i|xi ,Θg) = 1 (36)
53 / 72
It looks quite daunting, but...First notice the following
M∑y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg) =
=
M∑y1=1
· · ·M∑
yi−1=1
M∑yi+1=1
· · ·M∑
yN =1
N∏j=1,j 6=i
p (yj |xj ,Θg)
p (l|xi ,Θg)
=N∏
j=1,j 6=i
M∑yj=1
p (yj |xj ,Θg)
p (l|xi ,Θg)
=p (l|xi ,Θg) =αg
l pyi
(xi |θg
l
)∑Mk=1 α
gkpk(
xi |θgk
)Because
M∑i=1
p (i|xi ,Θg) = 1 (36)
53 / 72
It looks quite daunting, but...First notice the following
M∑y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg) =
=
M∑y1=1
· · ·M∑
yi−1=1
M∑yi+1=1
· · ·M∑
yN =1
N∏j=1,j 6=i
p (yj |xj ,Θg)
p (l|xi ,Θg)
=N∏
j=1,j 6=i
M∑yj=1
p (yj |xj ,Θg)
p (l|xi ,Θg)
=p (l|xi ,Θg) =αg
l pyi
(xi |θg
l
)∑Mk=1 α
gkpk(
xi |θgk
)Because
M∑i=1
p (i|xi ,Θg) = 1 (36)
53 / 72
It looks quite daunting, but...First notice the following
M∑y1=1
M∑y2=1
· · ·M∑
yN =1
δl,yi
N∏j=1
p (yj |xj ,Θg) =
=
M∑y1=1
· · ·M∑
yi−1=1
M∑yi+1=1
· · ·M∑
yN =1
N∏j=1,j 6=i
p (yj |xj ,Θg)
p (l|xi ,Θg)
=N∏
j=1,j 6=i
M∑yj=1
p (yj |xj ,Θg)
p (l|xi ,Θg)
=p (l|xi ,Θg) =αg
l pyi
(xi |θg
l
)∑Mk=1 α
gkpk(
xi |θgk
)Because
M∑i=1
p (i|xi ,Θg) = 1 (36)
53 / 72
Thus
We can write Q in the following way
Q (Θ,Θg) =N∑
i=1
M∑l=1
log [αlpl (xi |θl)] p (l|xi ,Θg)
=N∑
i=1
M∑l=1
log (αl) p (l|xi ,Θg) + ...
N∑i=1
M∑l=1
log (pl (xi |θl)) p (l|xi ,Θg) (37)
54 / 72
Thus
We can write Q in the following way
Q (Θ,Θg) =N∑
i=1
M∑l=1
log [αlpl (xi |θl)] p (l|xi ,Θg)
=N∑
i=1
M∑l=1
log (αl) p (l|xi ,Θg) + ...
N∑i=1
M∑l=1
log (pl (xi |θl)) p (l|xi ,Θg) (37)
54 / 72
Thus
We can write Q in the following way
Q (Θ,Θg) =N∑
i=1
M∑l=1
log [αlpl (xi |θl)] p (l|xi ,Θg)
=N∑
i=1
M∑l=1
log (αl) p (l|xi ,Θg) + ...
N∑i=1
M∑l=1
log (pl (xi |θl)) p (l|xi ,Θg) (37)
54 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
55 / 72
Lagrange Multipliers
To maximize this expressionWe can maximize the term containing αl and the term containing θl .
This can be doneIn an independent way since they are not related.
56 / 72
Lagrange Multipliers
To maximize this expressionWe can maximize the term containing αl and the term containing θl .
This can be doneIn an independent way since they are not related.
56 / 72
Lagrange Multipliers
We can us the following constraint for that∑lαl = 1 (38)
We have the following cost function
Q (Θ,Θg) + λ
(∑lαl − 1
)(39)
Deriving by αl
∂
∂αl
[Q (Θ,Θg) + λ
(∑lαl − 1
)]= 0 (40)
57 / 72
Lagrange Multipliers
We can us the following constraint for that∑lαl = 1 (38)
We have the following cost function
Q (Θ,Θg) + λ
(∑lαl − 1
)(39)
Deriving by αl
∂
∂αl
[Q (Θ,Θg) + λ
(∑lαl − 1
)]= 0 (40)
57 / 72
Lagrange Multipliers
We can us the following constraint for that∑lαl = 1 (38)
We have the following cost function
Q (Θ,Θg) + λ
(∑lαl − 1
)(39)
Deriving by αl
∂
∂αl
[Q (Θ,Θg) + λ
(∑lαl − 1
)]= 0 (40)
57 / 72
Lagrange Multipliers
We have thatN∑
i=1
1αl
p (l|xi ,Θg) + λ = 0 (41)
ThusN∑
i=1p (l|xi ,Θg) = −λαl (42)
Summing over l, we get
λ = −N (43)
58 / 72
Lagrange Multipliers
We have thatN∑
i=1
1αl
p (l|xi ,Θg) + λ = 0 (41)
ThusN∑
i=1p (l|xi ,Θg) = −λαl (42)
Summing over l, we get
λ = −N (43)
58 / 72
Lagrange Multipliers
We have thatN∑
i=1
1αl
p (l|xi ,Θg) + λ = 0 (41)
ThusN∑
i=1p (l|xi ,Θg) = −λαl (42)
Summing over l, we get
λ = −N (43)
58 / 72
Lagrange Multipliers
Thus
αl = 1N
N∑i=1
p (l|xi ,Θg) (44)
About θl
It is possible to get an analytical expressions for θl as functions ofeverything else.
This is for you in a homework!!!
For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.
59 / 72
Lagrange Multipliers
Thus
αl = 1N
N∑i=1
p (l|xi ,Θg) (44)
About θl
It is possible to get an analytical expressions for θl as functions ofeverything else.
This is for you in a homework!!!
For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.
59 / 72
Lagrange Multipliers
Thus
αl = 1N
N∑i=1
p (l|xi ,Θg) (44)
About θl
It is possible to get an analytical expressions for θl as functions ofeverything else.
This is for you in a homework!!!
For more look at“Geometric Idea of Lagrange Multipliers” by John Wyatt.
59 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
60 / 72
Remember?
Gaussian Distribution
pl (x|µl ,Σl) = 1(2π)d/2 |Σl |
1/2exp
{−1
2 (x − µl)T Σ−1l (x − µl)
}(45)
61 / 72
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑
i xTi Axi = tr (AB) where B =
∑i xixT
i .4∣∣A−1∣∣ = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f (A)∂A is going to be the matrix with i, j th entry
[∂f (A)∂ai,j
]where ai,j is
the i, j th entry of A.
62 / 72
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑
i xTi Axi = tr (AB) where B =
∑i xixT
i .4∣∣A−1∣∣ = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f (A)∂A is going to be the matrix with i, j th entry
[∂f (A)∂ai,j
]where ai,j is
the i, j th entry of A.
62 / 72
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑
i xTi Axi = tr (AB) where B =
∑i xixT
i .4∣∣A−1∣∣ = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f (A)∂A is going to be the matrix with i, j th entry
[∂f (A)∂ai,j
]where ai,j is
the i, j th entry of A.
62 / 72
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑
i xTi Axi = tr (AB) where B =
∑i xixT
i .4∣∣A−1∣∣ = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f (A)∂A is going to be the matrix with i, j th entry
[∂f (A)∂ai,j
]where ai,j is
the i, j th entry of A.
62 / 72
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra1 tr (A + B) = tr (A) + tr (B)2 tr (AB) = tr (BA)3∑
i xTi Axi = tr (AB) where B =
∑i xixT
i .4∣∣A−1∣∣ = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f (A)∂A is going to be the matrix with i, j th entry
[∂f (A)∂ai,j
]where ai,j is
the i, j th entry of A.
62 / 72
In addition
If A is symmetric
∂ |A|∂A =
{Ai,j if i = j2Ai,j if i 6= j
(46)
Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.
Thus
∂ log |A|∂A =
Ai,j|A| if i = j2Ai,j if i 6= j
= 2A−1 − diag(A−1
)(47)
63 / 72
In addition
If A is symmetric
∂ |A|∂A =
{Ai,j if i = j2Ai,j if i 6= j
(46)
Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.
Thus
∂ log |A|∂A =
Ai,j|A| if i = j2Ai,j if i 6= j
= 2A−1 − diag(A−1
)(47)
63 / 72
In addition
If A is symmetric
∂ |A|∂A =
{Ai,j if i = j2Ai,j if i 6= j
(46)
Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.
Thus
∂ log |A|∂A =
Ai,j|A| if i = j2Ai,j if i 6= j
= 2A−1 − diag(A−1
)(47)
63 / 72
In addition
If A is symmetric
∂ |A|∂A =
{Ai,j if i = j2Ai,j if i 6= j
(46)
Where Ai,j is the i, j th cofactor of A.Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor ispreceded by a + or – sign depending whether the element isin a + or – position.
Thus
∂ log |A|∂A =
Ai,j|A| if i = j2Ai,j if i 6= j
= 2A−1 − diag(A−1
)(47)
63 / 72
Finally
The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)
In addition∂xT Ax∂x (49)
64 / 72
Finally
The last equation we need∂tr (AB)∂A = B + BT − diag (B) (48)
In addition∂xT Ax∂x (49)
64 / 72
Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives
N∑i=1
M∑l=1
log (pl (xi |µl ,Σl)) p (l|xi ,Θg)
=N∑
i=1
M∑l=1
[−1
2 log (|Σl |)−12 (xi − µl)T Σ−1
l (xi − µl)]
p (l|xi ,Θg) (50)
Thus, when taking the derivative with respect to µlN∑
i=1
[Σ−1
l (xi − µl) p (l|xi ,Θg)]
= 0 (51)
Then
µl =∑N
i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)
(52)
65 / 72
Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives
N∑i=1
M∑l=1
log (pl (xi |µl ,Σl)) p (l|xi ,Θg)
=N∑
i=1
M∑l=1
[−1
2 log (|Σl |)−12 (xi − µl)T Σ−1
l (xi − µl)]
p (l|xi ,Θg) (50)
Thus, when taking the derivative with respect to µlN∑
i=1
[Σ−1
l (xi − µl) p (l|xi ,Θg)]
= 0 (51)
Then
µl =∑N
i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)
(52)
65 / 72
Thus, using last part of equation 37We get, after ignoring constant termsRemember they disappear after derivatives
N∑i=1
M∑l=1
log (pl (xi |µl ,Σl)) p (l|xi ,Θg)
=N∑
i=1
M∑l=1
[−1
2 log (|Σl |)−12 (xi − µl)T Σ−1
l (xi − µl)]
p (l|xi ,Θg) (50)
Thus, when taking the derivative with respect to µlN∑
i=1
[Σ−1
l (xi − µl) p (l|xi ,Θg)]
= 0 (51)
Then
µl =∑N
i=1 xip (l|xi ,Θg)∑Ni=1 p (l|xi ,Θg)
(52)
65 / 72
Now, if we derive with respect to Σl
First, we rewrite equation 50
N∑i=1
M∑l=1
[−
12
log (|Σl |)−12
(xi − µl)T Σ−1l (xi − µl)
]p (l|xi ,Θg)
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l (xi − µl) (xi − µl)T}]
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
Where Nl,i = (xi − µl) (xi − µl)T .
66 / 72
Now, if we derive with respect to Σl
First, we rewrite equation 50
N∑i=1
M∑l=1
[−
12
log (|Σl |)−12
(xi − µl)T Σ−1l (xi − µl)
]p (l|xi ,Θg)
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l (xi − µl) (xi − µl)T}]
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
Where Nl,i = (xi − µl) (xi − µl)T .
66 / 72
Now, if we derive with respect to Σl
First, we rewrite equation 50
N∑i=1
M∑l=1
[−
12
log (|Σl |)−12
(xi − µl)T Σ−1l (xi − µl)
]p (l|xi ,Θg)
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l (xi − µl) (xi − µl)T}]
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
Where Nl,i = (xi − µl) (xi − µl)T .
66 / 72
Now, if we derive with respect to Σl
First, we rewrite equation 50
N∑i=1
M∑l=1
[−
12
log (|Σl |)−12
(xi − µl)T Σ−1l (xi − µl)
]p (l|xi ,Θg)
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l (xi − µl) (xi − µl)T}]
=M∑
l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
Where Nl,i = (xi − µl) (xi − µl)T .
66 / 72
Deriving with respect to Σ−1l
We have that
∂
∂Σ−1l
M∑l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
=12
N∑i=1
p (l|xi ,Θg) (2Σl − diag (Σl))−12
N∑i=1
p (l|xi ,Θg)(
2Nl.i − diag(
Nl,i))
=12
N∑i=1
p (l|xi ,Θg)(
2Ml,i − diag(
Ml,i))
=2S − diag (S)
Where Ml,i = Σl −Nl,i and S = 12∑N
i=1 p (l|xi ,Θg) Ml,i
67 / 72
Deriving with respect to Σ−1l
We have that
∂
∂Σ−1l
M∑l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
=12
N∑i=1
p (l|xi ,Θg) (2Σl − diag (Σl))−12
N∑i=1
p (l|xi ,Θg)(
2Nl.i − diag(
Nl,i))
=12
N∑i=1
p (l|xi ,Θg)(
2Ml,i − diag(
Ml,i))
=2S − diag (S)
Where Ml,i = Σl −Nl,i and S = 12∑N
i=1 p (l|xi ,Θg) Ml,i
67 / 72
Deriving with respect to Σ−1l
We have that
∂
∂Σ−1l
M∑l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
=12
N∑i=1
p (l|xi ,Θg) (2Σl − diag (Σl))−12
N∑i=1
p (l|xi ,Θg)(
2Nl.i − diag(
Nl,i))
=12
N∑i=1
p (l|xi ,Θg)(
2Ml,i − diag(
Ml,i))
=2S − diag (S)
Where Ml,i = Σl −Nl,i and S = 12∑N
i=1 p (l|xi ,Θg) Ml,i
67 / 72
Deriving with respect to Σ−1l
We have that
∂
∂Σ−1l
M∑l=1
[−
12
log (|Σl |)N∑
i=1
p (l|xi ,Θg)−12
N∑i=1
p (l|xi ,Θg) tr{
Σ−1l Nl,i
}]
=12
N∑i=1
p (l|xi ,Θg) (2Σl − diag (Σl))−12
N∑i=1
p (l|xi ,Θg)(
2Nl.i − diag(
Nl,i))
=12
N∑i=1
p (l|xi ,Θg)(
2Ml,i − diag(
Ml,i))
=2S − diag (S)
Where Ml,i = Σl −Nl,i and S = 12∑N
i=1 p (l|xi ,Θg) Ml,i
67 / 72
Thus, we have
ThusIf 2S − diag (S) = 0 =⇒ S = 0
Implying
12
N∑i=1
p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)
Or
Σl =∑N
i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)
=∑N
i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)
(54)
68 / 72
Thus, we have
ThusIf 2S − diag (S) = 0 =⇒ S = 0
Implying
12
N∑i=1
p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)
Or
Σl =∑N
i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)
=∑N
i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)
(54)
68 / 72
Thus, we have
ThusIf 2S − diag (S) = 0 =⇒ S = 0
Implying
12
N∑i=1
p (l|xi ,Θg) [Σl −Nl,i ] = 0 (53)
Or
Σl =∑N
i=1 p (l|xi ,Θg) Nl,i∑Ni=1 p (l|xi ,Θg)
=∑N
i=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1 p (l|xi ,Θg)
(54)
68 / 72
Thus, we have the iterative updates
They are
αNewl = 1
N
N∑i=1
p (l|xi ,Θg)
µNewl =
∑Ni=1 xip (l|xi ,Θg)∑N
i=1 p (l|xi ,Θg)
ΣNewl =
∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N
i=1 p (l|xi ,Θg)
69 / 72
Thus, we have the iterative updates
They are
αNewl = 1
N
N∑i=1
p (l|xi ,Θg)
µNewl =
∑Ni=1 xip (l|xi ,Θg)∑N
i=1 p (l|xi ,Θg)
ΣNewl =
∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N
i=1 p (l|xi ,Θg)
69 / 72
Thus, we have the iterative updates
They are
αNewl = 1
N
N∑i=1
p (l|xi ,Θg)
µNewl =
∑Ni=1 xip (l|xi ,Θg)∑N
i=1 p (l|xi ,Θg)
ΣNewl =
∑Ni=1 p (l|xi ,Θg) (xi − µl) (xi − µl)T∑N
i=1 p (l|xi ,Θg)
69 / 72
Outline1 Introduction
Maximum-Likelihood
2 Incomplete DataAnalogy
3 Derivation of the EM-AlgorithmHidden FeaturesUsing Convex functionsThe Final AlgorithmNotes and Convergence of EM
4 Finding Maximum Likelihood Mixture DensitiesBayes’ Rule for the componentsMaximizing QGaussian DistributionsThe EM Algorithm
70 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
EM for Gaussian Mixtures1 1. Initialize the means µl , covariances Σl and mixing coefficients αl , and evaluate the initial value of the log
likelihood.2 E-Step:
I Evaluate the the probabilities of component l given xi using the current parameter values:
p (l|xi ,Θg) =α
gl pyi
(xi |θ
gl
)∑Mk=1
αgk pk(
xi |θgk
) .3 M-Step. Re-estimate the parameters using the current iteration values:
αNewl =
1N
N∑i=1
p(
l|xi ,Θg)
µNewl =
∑Ni=1
xip (l|xi ,Θg)∑Ni=1
p (l|xi ,Θg)
ΣNewl =
∑Ni=1
p (l|xi ,Θg) (xi − µl) (xi − µl)T∑Ni=1
p (l|xi ,Θg)
4 Evaluate the log likelihood: log p (X|µ,Σ,α) =∑N
i=1log{∑M
l=1αlpl (xi |µl ,Σl)
}I Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied
return to step 2. 71 / 72
Exercises
Duda and HartChapter 3
3.39, 3.40, 3.41
72 / 72