cs 6347 lecture 18 - university of texas at dallas
TRANSCRIPT
CS 6347
Lecture 18
Expectation Maximization
Unobserved Variables
β’ Latent or hidden variables in the model are never observed
β’ We may or may not be interested in their values, but their
existence is crucial to the model
β’ Some observations in a particular sample may be missing
β’ Missing information on surveys or medical records (quite
common)
β’ We may need to model how the variables are missing
2
Hidden Markov Models
π π₯1, β¦ , π₯π , π¦1, β¦ , π¦π = π π¦1 π π₯1 π¦1
π‘
π π¦π‘ π¦π‘β1 π(π₯π‘|π¦π‘)
β’ πβs are observed variables, πβs are latent
β’ Example: π variables correspond sizes of tree growth rings for one year, the π variables correspond to average temperature
π1 π2 ππβ1 ππ...
π1 π2 ππβ1 ππ...
Missing Data
β’ Data can be missing from the model in many different ways
β Missing completely at random: the probability that a data item is
missing is independent of the observed data and the other
missing data
β Missing at random: the probability that a data item is missing
can depend on the observed data
β Missing not at random: the probability that a data item is missing
can depend on the observed data and the other missing data
4
Handling Missing Data
β’ Discard all incomplete observations
β Can introduce bias
β’ Imputation: actual values are substituted for missing values so that all of the data is fully observed
β E.g., find the most probable assignments for the missing data and substitute them in (not possible if the model is unknown)
β Use the sample mean/mode
β’ Explicitly model the missing data
β For example, could expand the state space
β The most sensible solution, but may be non-trivial if we donβt know how/why the data is missing
5
Modelling Missing Data
β’ Add additional binary variable ππ to the model for each possible
observed variable π₯π that indicates whether or not that variable is
observed
π π₯πππ , π₯πππ , π = π π π₯πππ , π₯πππ π(π₯πππ , π₯πππ )
6
Modelling Missing Data
β’ Add additional binary variable ππ to the model for each possible
observed variable π₯π that indicates whether or not that variable is
observed
π π₯πππ , π₯πππ , π = π π π₯πππ , π₯πππ π(π₯πππ , π₯πππ )
7
Explicit model of the missing data
(missing not at random)
Modelling Missing Data
β’ Add additional binary variable ππ to the model for each possible
observed variable π₯π that indicates whether or not that variable is
observed
π π₯πππ , π₯πππ , π = π π π₯πππ π(π₯πππ , π₯πππ )
8
Missing at
random
Modelling Missing Data
β’ Add additional binary variable ππ to the model for each possible
observed variable π₯π that indicates whether or not that variable is
observed
π π₯πππ , π₯πππ , π = π(π)π(π₯πππ , π₯πππ )
9
Missing
completely at
random
Modelling Missing Data
β’ Add additional binary variable ππ to the model for each possible
observed variable π₯π that indicates whether or not that variable is
observed
π π₯πππ , π₯πππ , π = π(π)π(π₯πππ , π₯πππ )
10
Missing
completely at
random
How can you model latent
variables in this framework?
Learning with Missing Data
β’ In order to design learning algorithms for models with missing data,
we will make two assumptions
β The data is missing at random
β The model parameters corresponding to the missing data (πΏ) are
separate from the model parameters of the observed data (π)
β’ That is
π π₯πππ , π|π, πΏ = π π π₯πππ , πΏ π(π₯πππ |π)
11
Learning with Missing Data
π π₯πππ , π|π, πΏ = π π π₯πππ , πΏ π π₯πππ π
β’ Under the previous assumptions, the log-likelihood of samples
π₯1, π1 , β¦ , (π₯πΎ , ππΎ) is equal to
π π, πΏ =
π=1
πΎ
log π(ππ|π₯πππ π , πΏ) +
π=1
πΎ
log
π₯πππ π
π(π₯πππ ππ , π₯πππ π|π)
12
Learning with Missing Data
π π₯πππ , π|π, πΏ = π π π₯πππ , πΏ π π₯πππ π
β’ Under the previous assumptions, the log-likelihood of samples
π₯1, π1 , β¦ , (π₯πΎ , ππΎ) is equal to
π π, πΏ =
π=1
πΎ
log π(ππ|π₯πππ π , πΏ) +
π=1
πΎ
log
π₯πππ π
π(π₯πππ ππ , π₯πππ π|π)
13
Separable in π and πΏ, so if we donβt care about πΏ, then we
only have to maximize the second term over π
Learning with Missing Data
π π =
π=1
πΎ
log
π₯πππ π
π(π₯πππ ππ , π₯πππ π|π)
β’ This is NOT a concave function of π
β In the worst case, could have a different local maximum for each
possible value of the missing data
β No longer have a closed form solution, even in the case of
Bayesian networks
14
Expectation Maximization
β’ The expectation-maximization algorithm (EM) is method to find a
local maximum or a saddle point of the log-likelihood with missing
data
β’ Basic idea:
15
π π =
π=1
πΎ
log
π₯πππ π
π(π₯πππ ππ , π₯πππ π|π)
=
π=1
πΎ
log
π₯πππ π
ππ π₯πππ π β π π₯πππ ππ , π₯πππ π π
ππ π₯πππ
β₯
π=1
πΎ
π₯πππ π
ππ π₯πππ π logπ π₯πππ ππ , π₯πππ π π
ππ π₯πππ π
Expectation Maximization
πΉ π, π β‘
π=1
πΎ
π₯πππ π
ππ π₯πππ π logπ π₯πππ ππ , π₯πππ π π
ππ π₯πππ π
β’ Maximizing πΉ is equivalent to the maximizing the log-likelihood
β’ Could maximize it using coordinate ascent
ππ‘+1 = arg maxπ1,β¦,ππΎπΉ(π, ππ‘)
ππ‘+1 = argmaxππΉ(ππ‘+1, π)
16
Expectation Maximization
π₯πππ π
ππ π₯πππ π logπ π₯πππ ππ , π₯πππ π π
ππ π₯πππ π
β’ This is just βπ ππ||π π₯πππ ππ ,β π
β’ Maximized when ππ π₯πππ π = π(π₯πππ π|π₯πππ ππ , π)
β’ Can reformulate the EM algorithm as
ππ‘+1 = argmaxπ
π=1
πΎ
π₯πππ π
π(π₯πππ π|π₯πππ ππ , ππ‘) log π π₯πππ π
π , π₯πππ π π
17
An Example: Bayesian Networks
β’ Recall that MLE for Bayesian networks without latent variables
yielded
ππ₯π|π₯ππππππ‘π π =Nπ₯π,π₯parents π π₯πβ²Nπ₯πβ²,π₯parents π
β’ Letβs suppose that we are given observations from a Bayesian
network in which one of the variables is hidden
β What do the iterations of the EM algorithm look like?
(on board)
18