cs b 553: a lgorithms for o ptimization and l earning parameter learning with hidden variables &...

36
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Parameter Learning with Hidden Variables & Expectation Maximization

Upload: debra-mcgee

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGParameter Learning with Hidden Variables & Expectation Maximization

Page 2: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

AGENDA

Learning probability distributions from data in the setting of known structure, missing data

Expectation-maximization (EM) algorithm

Page 3: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

BASIC PROBLEM

Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variables Z

Fit the distribution P(X,Z) to the data Interpretation: each example x[m] is an

incomplete view of the “underlying” sample (x[m],z[m])

Z

X

Page 4: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

APPLICATIONS

Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence,

personality) Document classification Human activity recognition

Page 5: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS

Hidden variables => conditional independences

Z

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

Page 6: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS

Hidden variables => conditional independences

Z

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

1+4*2=9 parameters

1+2+4+8=15 parameters

Page 7: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

GENERATING MODEL

z[1]

x[1]

z[M]

x[M]

qz

qx|z

These CPTs are identical and given

These CPTs are identical and given

Page 8: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: DISCRETE VARIABLES

z[1]

x[1]

z[M]

x[M]

qz

qx|z

Categorical distributions given by parameters qz

P(Z[i] | qz) = Categorical(qz)

Categorical distribution P(X[i]|z[i],qx|

z[i]) = Categorical(qx|z[i])

(in other words, z[i] multiplexes between Categorical distributions)

Page 9: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

MAXIMUM LIKELIHOOD ESTIMATION

Approach: find values of = (q qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data

L(q, DZ ; D) = P(D|q, DZ )

Find arg max L(q, DZ ; D) over q, DZ

Page 10: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

MARGINAL LIKELIHOOD ESTIMATION

Approach: find values of = (q qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M])

L(q; D) = SDz P(D, DZ |q) Find arg max L(q; D) over q

(A partially Bayesian approach)

Page 11: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

COMPUTATIONAL CHALLENGES

P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but…

Maximum likelihood arg max L(q, DZ ; D) Optimizing over M assignments to Z (|Val(Z)|M

possible joint assignments) as well as continuous parameters

Maximum marginal likelihood arg max L(q; D) Optimizing locally over continuous parameters,

but objective requires summing over M assignments to Z

Page 12: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXPECTATION MAXIMIZATION FOR ML

Idea: use a coordinate ascent approach arg maxq, DZ L(q, DZ ; D) =

arg maxq max DZ L(q, DZ ; D)

Step 1: Finding DZ* = arg max DZ L(q, DZ ; D)

is easy given a fixed q Fully observed, ML parameter estimation

Step 2: Set Q(q) = L(q, DZ*; D) Finding

q*=arg maxq Q( )q is easy given that DZ is fixed Fully observed, ML parameter estimation

Repeat steps 1 and 2 until convergence

Page 13: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z[1]

x1[1]

z[M]

x1[M]

qz

qx1|z

x2[1]x2[M]

qx1|z

z

x1

qz

qx1|z

x2

qx1|z

M

Plate notationUnrolled network

Page 14: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Page 15: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6

Page 16: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Page 17: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Page 18: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Converged (true ML estimate)

Page 19: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation

x3qx3|z

x4qx4|z

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03

Log likelihood -5176

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Page 20: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03

Log likelihood -4401

Page 21: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: M STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21

Log likelihood -3033

Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

Page 22: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Z AssignmentsCurrent estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21

Log likelihood -2965

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 1

1,0 1 1 1 1

1,1 2 1 2 1

Page 23: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.40qX1|Z=1 = 0.56qX2|Z=1 = 0.31qX3|Z=1 = 0.40qX4|Z=1 = 0.92qX1|Z=2 = 0.45qX2|Z=2 = 0.66qX3|Z=2 = 0.26qX4|Z=2 = 0.04

Log likelihood -2859

Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 2 2 2

1,0 1 1 1 1

1,1 2 1 1 1

Page 24: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: LAST E-M STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Current estimatesqZ = 0.43qX1|Z=1 = 0.51qX2|Z=1 = 0.36qX3|Z=1 = 0.35qX4|Z=1 = 1qX1|Z=2 = 0.53qX2|Z=2 = 0.57qX3|Z=2 = 0.33qX4|Z=2 = 0

Log likelihood -2683

Z Assignments

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 2 1 2 1

0,1 2 1 2 1

1,0 2 1 2 1

1,1 2 1 2 1

Page 25: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

PROBLEM: MANY LOCAL MINIMA

Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape!

Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)

Page 26: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXPECTATION MAXIMIZATION FOR MML

arg maxq L(q, D) =arg maxq EDZ|D,q [L(q; DZ , D)]

Do arg maxq EDZ|D,q [log L(q; DZ , D)] instead (justified later)

Step 1: Given current fixed qt, find P(Dz|qt, D) Compute a distribution over each Z[i]

Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q( ). q Now find maxq Q( )q Fully observed, weighted, ML parameter

estimation Repeat steps 1 (expectation) and 2

(maximization) until convergence

Page 27: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

E STEP IN DETAIL

Ultimately, want to maximize Q( | q qt) = EDZ|

D,qt [log L(q; DZ , D)] over q Q( | q qt) =

Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)

E step computes the terms

wm,z(qt)=P(Z[m]=z |D, qt )

over all examples m and zVal[Z]

Page 28: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

M STEP IN DETAIL

arg maxq Q( | q qt) = Sm Sz wm,z(qt) log P (x[m]|q, z[m]=z)= argmax Pm Pz P (x[m]|q, z[m]=z)^(wm,z(qt))

This is weighted ML Each z[m] is interpreted to be observed wm,z(qt)

times

Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

Page 29: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: BERNOULLI PARAMETER FOR Z

qZ* = arg maxqz Sm Sz wm,z log P (x[m],z[m]=z

|qZ)= arg maxqz Sm Sz wm,z log (I[z=1]qZ + I[z=0](1-qZ)= arg maxqz [log (qZ) Sm wm,z=1 + log(1-qZ)Sm

wm,z=0]

=> qZ* = (Sm wm,z=1 )/ Sm (wm,z=1+ wm,z=0)

“Expected counts” Mqt[z] = Sm wm,z(qt)Express qZ

* = Mqt[z=1] / Mqt[ ]

Page 30: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EXAMPLE: BERNOULLI PARAMETERS FOR XI | Z

qXi|z=k* = arg maxqz Smwm,z=klog P(x[m],z[m]=k |qXi|z=k)

= arg maxqxi|z=k Sm Sz wm,z log (I[xi[m]=1,z=k]qXi|z=k + I[xi[m]=0,z=k](1-qXi|z=k)

= … (similar derivation)

=> qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]

Page 31: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

EM ON PRIOR EXAMPLE (100 ITERATIONS)

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 115 142 20 47

0,1 32 16 37 75

1,0 12 117 39 58

1,1 133 92 45 20

Final estimatesqZ = 0.49qX1|Z=1 = 0.64qX2|Z=1 = 0.88qX3|Z=1 = 0.41qX4|Z=1 = 0.46qX1|Z=2 = 0.38qX2|Z=2 = 0.00qX3|Z=2 = 0.27qX4|Z=2 = 0.68

Log likelihood -2833

P(Z)=2

x3,x4

x1,x

2

0,0 0,1 1,0 1,1

0,0 0.90

0.95 0.84

0.93

0,1 0.00

0.00 0.00

0.00

1,0 0.76

0.89 0.64

0.82

1,1 0.00

0.00 0.00

0.00

Page 32: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

CONVERGENCE

In general, no way to tell a priori how fast EM will converge

Soft EM is usually slower than hard EM

Still runs into local minima, but has more opportunities to coordinate parameter adjustments

1 22 43 64 85 106127148169190

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

Iteration countLog

lik

eli

ho

od

Page 33: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

WHY DOES IT WORK?

Why are we optimizing over Q( | q qt) =Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m],

z[m]|q) rather than the true marginalized likelihood:L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?

Page 34: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

WHY DOES IT WORK?

Why are we optimizing over Q( | q qt) =Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m],

z[m]|q) rather than the true marginalized likelihood:L( |q D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?

Can prove that: The log likelihood is increased at every step A stationary point of arg maxq EDZ|D,q [L(q; DZ , D)]

is a stationary point of log L( |q D) see K&F p882-884

Page 35: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

GAUSSIAN CLUSTERING USING EM

One of the first uses of EM Widely used approach Finding good starting points:

k-means algorithm (Hard assignment)

Handling degeneracies Regularization

Page 36: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

RECAP

Learning with hidden variables Typically categorical