cs b 553: a lgorithms for o ptimization and l earning parameter learning with hidden variables &...
TRANSCRIPT
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGParameter Learning with Hidden Variables & Expectation Maximization
AGENDA
Learning probability distributions from data in the setting of known structure, missing data
Expectation-maximization (EM) algorithm
BASIC PROBLEM
Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variables Z
Fit the distribution P(X,Z) to the data Interpretation: each example x[m] is an
incomplete view of the “underlying” sample (x[m],z[m])
Z
X
APPLICATIONS
Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence,
personality) Document classification Human activity recognition
HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS
Hidden variables => conditional independences
Z
X1 X2 X3 X4
X1 X2
X3 X4
Without Z, the observables becomefully dependent
HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS
Hidden variables => conditional independences
Z
X1 X2 X3 X4
X1 X2
X3 X4
Without Z, the observables becomefully dependent
1+4*2=9 parameters
1+2+4+8=15 parameters
GENERATING MODEL
z[1]
x[1]
z[M]
x[M]
qz
qx|z
These CPTs are identical and given
These CPTs are identical and given
EXAMPLE: DISCRETE VARIABLES
z[1]
x[1]
z[M]
x[M]
qz
qx|z
Categorical distributions given by parameters qz
P(Z[i] | qz) = Categorical(qz)
Categorical distribution P(X[i]|z[i],qx|
z[i]) = Categorical(qx|z[i])
(in other words, z[i] multiplexes between Categorical distributions)
MAXIMUM LIKELIHOOD ESTIMATION
Approach: find values of = (q qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data
L(q, DZ ; D) = P(D|q, DZ )
Find arg max L(q, DZ ; D) over q, DZ
MARGINAL LIKELIHOOD ESTIMATION
Approach: find values of = (q qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M])
L(q; D) = SDz P(D, DZ |q) Find arg max L(q; D) over q
(A partially Bayesian approach)
COMPUTATIONAL CHALLENGES
P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but…
Maximum likelihood arg max L(q, DZ ; D) Optimizing over M assignments to Z (|Val(Z)|M
possible joint assignments) as well as continuous parameters
Maximum marginal likelihood arg max L(q; D) Optimizing locally over continuous parameters,
but objective requires summing over M assignments to Z
EXPECTATION MAXIMIZATION FOR ML
Idea: use a coordinate ascent approach arg maxq, DZ L(q, DZ ; D) =
arg maxq max DZ L(q, DZ ; D)
Step 1: Finding DZ* = arg max DZ L(q, DZ ; D)
is easy given a fixed q Fully observed, ML parameter estimation
Step 2: Set Q(q) = L(q, DZ*; D) Finding
q*=arg maxq Q( )q is easy given that DZ is fixed Fully observed, ML parameter estimation
Repeat steps 1 and 2 until convergence
EXAMPLE: CORRELATED VARIABLES
z[1]
x1[1]
z[M]
x1[M]
qz
qx1|z
x2[1]x2[M]
qx1|z
z
x1
qz
qx1|z
x2
qx1|z
M
Plate notationUnrolled network
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2
qx2|z
M
Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time
X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2
qx2|z
M
Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time
X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32
Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2
qx2|z
M
Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time
X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32
Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6
Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2
qx2|z
M
Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time
X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32
Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919
Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2
qx2|z
M
Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time
X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32
Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919
Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2
Converged (true ML estimate)
EXAMPLE: CORRELATED VARIABLES
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation
x3qx3|z
x4qx4|z
Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03
Log likelihood -5176
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
EXAMPLE: E STEP
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Z Assignments
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 2 1 2 1
0,1 2 2 2 2
1,0 1 1 1 1
1,1 2 1 1 1
Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03
Log likelihood -4401
EXAMPLE: M STEP
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Current estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21
Log likelihood -3033
Z Assignments
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 2 1 2 1
0,1 2 2 2 2
1,0 1 1 1 1
1,1 2 1 1 1
EXAMPLE: E STEP
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Z AssignmentsCurrent estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21
Log likelihood -2965
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 2 1 2 1
0,1 2 2 2 1
1,0 1 1 1 1
1,1 2 1 2 1
EXAMPLE: E STEP
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Current estimatesqZ = 0.40qX1|Z=1 = 0.56qX2|Z=1 = 0.31qX3|Z=1 = 0.40qX4|Z=1 = 0.92qX1|Z=2 = 0.45qX2|Z=2 = 0.66qX3|Z=2 = 0.26qX4|Z=2 = 0.04
Log likelihood -2859
Z Assignments
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 2 1 2 1
0,1 2 2 2 2
1,0 1 1 1 1
1,1 2 1 1 1
EXAMPLE: LAST E-M STEP
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Current estimatesqZ = 0.43qX1|Z=1 = 0.51qX2|Z=1 = 0.36qX3|Z=1 = 0.35qX4|Z=1 = 1qX1|Z=2 = 0.53qX2|Z=2 = 0.57qX3|Z=2 = 0.33qX4|Z=2 = 0
Log likelihood -2683
Z Assignments
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 2 1 2 1
0,1 2 1 2 1
1,0 2 1 2 1
1,1 2 1 2 1
PROBLEM: MANY LOCAL MINIMA
Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape!
Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)
EXPECTATION MAXIMIZATION FOR MML
arg maxq L(q, D) =arg maxq EDZ|D,q [L(q; DZ , D)]
Do arg maxq EDZ|D,q [log L(q; DZ , D)] instead (justified later)
Step 1: Given current fixed qt, find P(Dz|qt, D) Compute a distribution over each Z[i]
Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q( ). q Now find maxq Q( )q Fully observed, weighted, ML parameter
estimation Repeat steps 1 (expectation) and 2
(maximization) until convergence
E STEP IN DETAIL
Ultimately, want to maximize Q( | q qt) = EDZ|
D,qt [log L(q; DZ , D)] over q Q( | q qt) =
Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)
E step computes the terms
wm,z(qt)=P(Z[m]=z |D, qt )
over all examples m and zVal[Z]
M STEP IN DETAIL
arg maxq Q( | q qt) = Sm Sz wm,z(qt) log P (x[m]|q, z[m]=z)= argmax Pm Pz P (x[m]|q, z[m]=z)^(wm,z(qt))
This is weighted ML Each z[m] is interpreted to be observed wm,z(qt)
times
Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case
EXAMPLE: BERNOULLI PARAMETER FOR Z
qZ* = arg maxqz Sm Sz wm,z log P (x[m],z[m]=z
|qZ)= arg maxqz Sm Sz wm,z log (I[z=1]qZ + I[z=0](1-qZ)= arg maxqz [log (qZ) Sm wm,z=1 + log(1-qZ)Sm
wm,z=0]
=> qZ* = (Sm wm,z=1 )/ Sm (wm,z=1+ wm,z=0)
“Expected counts” Mqt[z] = Sm wm,z(qt)Express qZ
* = Mqt[z=1] / Mqt[ ]
EXAMPLE: BERNOULLI PARAMETERS FOR XI | Z
qXi|z=k* = arg maxqz Smwm,z=klog P(x[m],z[m]=k |qXi|z=k)
= arg maxqxi|z=k Sm Sz wm,z log (I[xi[m]=1,z=k]qXi|z=k + I[xi[m]=0,z=k](1-qXi|z=k)
= … (similar derivation)
=> qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]
EM ON PRIOR EXAMPLE (100 ITERATIONS)
z
x1
qz
qx1|z
x2qx2|z
M
Plate notation X Dataset
x3qx3|z
x4qx4|z
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 115 142 20 47
0,1 32 16 37 75
1,0 12 117 39 58
1,1 133 92 45 20
Final estimatesqZ = 0.49qX1|Z=1 = 0.64qX2|Z=1 = 0.88qX3|Z=1 = 0.41qX4|Z=1 = 0.46qX1|Z=2 = 0.38qX2|Z=2 = 0.00qX3|Z=2 = 0.27qX4|Z=2 = 0.68
Log likelihood -2833
P(Z)=2
x3,x4
x1,x
2
0,0 0,1 1,0 1,1
0,0 0.90
0.95 0.84
0.93
0,1 0.00
0.00 0.00
0.00
1,0 0.76
0.89 0.64
0.82
1,1 0.00
0.00 0.00
0.00
CONVERGENCE
In general, no way to tell a priori how fast EM will converge
Soft EM is usually slower than hard EM
Still runs into local minima, but has more opportunities to coordinate parameter adjustments
1 22 43 64 85 106127148169190
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
Iteration countLog
lik
eli
ho
od
WHY DOES IT WORK?
Why are we optimizing over Q( | q qt) =Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m],
z[m]|q) rather than the true marginalized likelihood:L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?
WHY DOES IT WORK?
Why are we optimizing over Q( | q qt) =Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m],
z[m]|q) rather than the true marginalized likelihood:L( |q D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?
Can prove that: The log likelihood is increased at every step A stationary point of arg maxq EDZ|D,q [L(q; DZ , D)]
is a stationary point of log L( |q D) see K&F p882-884
GAUSSIAN CLUSTERING USING EM
One of the first uses of EM Widely used approach Finding good starting points:
k-means algorithm (Hard assignment)
Handling degeneracies Regularization
RECAP
Learning with hidden variables Typically categorical