cs b 553: a lgorithms for o ptimization and l earning parameter learning with hidden variables &...

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGParameter Learning with Hidden Variables & Expectation Maximization

AGENDA

Learning probability distributions from data in the setting of known structure, missing data

Expectation-maximization (EM) algorithm

BASIC PROBLEM

Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variables Z

Fit the distribution P(X,Z) to the data Interpretation: each example x[m] is an

incomplete view of the “underlying” sample (x[m],z[m])

Z

X

APPLICATIONS

Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence,

personality) Document classification Human activity recognition

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS

Hidden variables => conditional independences

Z

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS

Hidden variables => conditional independences

Z

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

1+4*2=9 parameters

1+2+4+8=15 parameters

GENERATING MODEL

z[1]

x[1]

z[M]

x[M]

qz

qx|z

These CPTs are identical and given

These CPTs are identical and given

EXAMPLE: DISCRETE VARIABLES

z[1]

x[1]

z[M]

x[M]

qz

qx|z

Categorical distributions given by parameters qz

P(Z[i] | qz) = Categorical(qz)

Categorical distribution P(X[i]|z[i],qx|

z[i]) = Categorical(qx|z[i])

(in other words, z[i] multiplexes between Categorical distributions)

MAXIMUM LIKELIHOOD ESTIMATION

Approach: find values of = (q qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data

L(q, DZ ; D) = P(D|q, DZ )

Find arg max L(q, DZ ; D) over q, DZ

MARGINAL LIKELIHOOD ESTIMATION

Approach: find values of = (q qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M])

L(q; D) = SDz P(D, DZ |q) Find arg max L(q; D) over q

(A partially Bayesian approach)

COMPUTATIONAL CHALLENGES

P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but…

Maximum likelihood arg max L(q, DZ ; D) Optimizing over M assignments to Z (|Val(Z)|M

possible joint assignments) as well as continuous parameters

Maximum marginal likelihood arg max L(q; D) Optimizing locally over continuous parameters,

but objective requires summing over M assignments to Z

EXPECTATION MAXIMIZATION FOR ML

Idea: use a coordinate ascent approach arg maxq, DZ L(q, DZ ; D) =

arg maxq max DZ L(q, DZ ; D)

Step 1: Finding DZ* = arg max DZ L(q, DZ ; D)

is easy given a fixed q Fully observed, ML parameter estimation

Step 2: Set Q(q) = L(q, DZ*; D) Finding

q*=arg maxq Q( )q is easy given that DZ is fixed Fully observed, ML parameter estimation

Repeat steps 1 and 2 until convergence

EXAMPLE: CORRELATED VARIABLES

z[1]

x1[1]

z[M]

x1[M]

qz

qx1|z

x2[1]x2[M]

qx1|z

z

x1

qz

qx1|z

x2

qx1|z

M

Plate notationUnrolled network


z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32


z

x1

qz

qx1|z

x2

qx2|z

M


X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6


z

x1

qz

qx1|z

x2

qx2|z

M


X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2


z

x1

qz

qx1|z

x2

qx2|z

M


X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919



z

x1

qz

qx1|z

x2

qx2|z

M


X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919


Converged (true ML estimate)


z

x1

qz

qx1|z

x2qx2|z

M

Plate notation

x3qx3|z

x4qx4|z

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03

Log likelihood -5176

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03


EXAMPLE: M STEP

z

x1

qz

qx1|z

x2qx2|z

M


x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21


Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M


x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Z AssignmentsCurrent estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21


x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 1

1,0 1 1 1 1

1,1 2 1 2 1

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M


x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.40qX1|Z=1 = 0.56qX2|Z=1 = 0.31qX3|Z=1 = 0.40qX4|Z=1 = 0.92qX1|Z=2 = 0.45qX2|Z=2 = 0.66qX3|Z=2 = 0.26qX4|Z=2 = 0.04


Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

EXAMPLE: LAST E-M STEP

z

x1

qz

qx1|z

x2qx2|z

M


x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.43qX1|Z=1 = 0.51qX2|Z=1 = 0.36qX3|Z=1 = 0.35qX4|Z=1 = 1qX1|Z=2 = 0.53qX2|Z=2 = 0.57qX3|Z=2 = 0.33qX4|Z=2 = 0


Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 1 2 1

1,0 2 1 2 1

1,1 2 1 2 1

PROBLEM: MANY LOCAL MINIMA

Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape!

Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)

EXPECTATION MAXIMIZATION FOR MML

arg maxq L(q, D) =arg maxq EDZ|D,q [L(q; DZ , D)]

Do arg maxq EDZ|D,q [log L(q; DZ , D)] instead (justified later)

Step 1: Given current fixed qt, find P(Dz|qt, D) Compute a distribution over each Z[i]

Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q( ). q Now find maxq Q( )q Fully observed, weighted, ML parameter

estimation Repeat steps 1 (expectation) and 2

(maximization) until convergence

M STEP IN DETAIL

arg maxq Q( | q qt) = Sm Sz wm,z(qt) log P (x[m]|q, z[m]=z)= argmax Pm Pz P (x[m]|q, z[m]=z)^(wm,z(qt))

This is weighted ML Each z[m] is interpreted to be observed wm,z(qt)

times

Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

EXAMPLE: BERNOULLI PARAMETER FOR Z

qZ* = arg maxqz Sm Sz wm,z log P (x[m],z[m]=z

|qZ)= arg maxqz Sm Sz wm,z log (I[z=1]qZ + I[z=0](1-qZ)= arg maxqz [log (qZ) Sm wm,z=1 + log(1-qZ)Sm

wm,z=0]

=> qZ* = (Sm wm,z=1 )/ Sm (wm,z=1+ wm,z=0)

“Expected counts” Mqt[z] = Sm wm,z(qt)Express qZ

* = Mqt[z=1] / Mqt[ ]

EM ON PRIOR EXAMPLE (100 ITERATIONS)

z

x1

qz

qx1|z

x2qx2|z

M


x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Final estimatesqZ = 0.49qX1|Z=1 = 0.64qX2|Z=1 = 0.88qX3|Z=1 = 0.41qX4|Z=1 = 0.46qX1|Z=2 = 0.38qX2|Z=2 = 0.00qX3|Z=2 = 0.27qX4|Z=2 = 0.68


P(Z)=2

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 0.90

0.95 0.84

0.93

0,1 0.00

0.00 0.00

0.00

1,0 0.76

0.89 0.64

0.82

1,1 0.00

0.00 0.00

0.00

CONVERGENCE

In general, no way to tell a priori how fast EM will converge

Soft EM is usually slower than hard EM

Still runs into local minima, but has more opportunities to coordinate parameter adjustments

1 22 43 64 85 106127148169190

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

Iteration countLog

lik

eli

ho

od

GAUSSIAN CLUSTERING USING EM

One of the first uses of EM Widely used approach Finding good starting points:

k-means algorithm (Hard assignment)

Handling degeneracies Regularization

RECAP

Learning with hidden variables Typically categorical

cs b 553: a lgorithms for o ptimization and l earning parameter learning with hidden variables &...

Documents