18-661 introduction to machine learning

45
18-661 Introduction to Machine Learning The EM Algorithm Spring 2020 ECE – Carnegie Mellon University

Upload: others

Post on 24-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-661 Introduction to Machine Learning

18-661 Introduction to Machine Learning

The EM Algorithm

Spring 2020

ECE – Carnegie Mellon University

Page 2: 18-661 Introduction to Machine Learning

Announcements

• HW 6 is due on Sunday, April 12.

• HW 7 will be released later this week.

• HW 7 will be entirely focused on programming, with the goal of

letting you gain experience in implementing ML models.

• Lecture on April 13 will be a PyTorch tutorial by Ritwick. Please

attend if you are not familiar with PyTorch. You will need to know

your way around PyTorch for HW 7.

• You cannot drop HW 7. Its score will be weighted equally with each

of your best 5 scores from the first six homeworks to determine your

overall homework grade.

• The final exam will be on Wednesday, April 29. It will have online

and take-home parts.

1

Page 3: 18-661 Introduction to Machine Learning

Outline

1. Recap: Gaussian Mixture Models

2. EM Algorithm

2

Page 4: 18-661 Introduction to Machine Learning

Recap: Gaussian Mixture Models

Page 5: 18-661 Introduction to Machine Learning

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close to

their cluster centers?

(b)

0 0.5 1

0

0.5

1• Points seem to form 3 clusters

• We cannot model p(x) with

simple and known distributions

• E.g., the data is not a Gaussian

b/c we have 3 distinct

concentrated regions...

3

Page 6: 18-661 Introduction to Machine Learning

Gaussian mixture models: Intuition

(a)

0 0.5 1

0

0.5

1

• Key idea: Model each region

with a distinct distribution

• Can use Gaussians — Gaussian

mixture models (GMMs)

• *However*, we don’t know

cluster assignments (label),

parameters of Gaussians, or

mixture components!

• Must learn these values from

unlabeled data D = {xn}Nn=1

4

Page 7: 18-661 Introduction to Machine Learning

Recall: Gaussian (normal) distributions

x ∼ N(µ,Σ) f (x) =exp(− 1

2 (x− µ)>Σ−1(x− µ))√(2π)n|Σ|

5

Page 8: 18-661 Introduction to Machine Learning

Gaussian mixture models: Formal definition

GMM has the following density function for x

p(x) =K∑

k=1

ωkN(x |µk ,Σk)

• K : number of Gaussians — they are called mixture components

• µk and Σk : mean and covariance matrix of k-th component

• ωk : mixture weights (or priors) represent how much each component

contributes to final distribution. They satisfy 2 properties:

∀ k , ωk > 0, and∑k

ωk = 1

These properties ensure p(x) is in fact a probability density function.

6

Page 9: 18-661 Introduction to Machine Learning

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x , z) = p(z)p(x |z)

where z is a discrete random variable taking values between 1 and K .

Denote

ωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x |z = k) = N(x |µk ,Σk)

Then, the marginal distribution of x is

p(x) =K∑

k=1

ωkN(x |µk ,Σk)

Namely, the Gaussian mixture model

7

Page 10: 18-661 Introduction to Machine Learning

Gaussian mixture model for clustering

8

Page 11: 18-661 Introduction to Machine Learning

GMMs: Example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z

(representing color) are

p(x |z = red) = N(x |µ1,Σ1)

p(x |z = blue) = N(x |µ2,Σ2)

p(x |z = green) = N(x |µ3,Σ3)

(b)

0 0.5 1

0

0.5

1

The marginal distribution is thus

p(x) = p(z = red)N(x |µ1,Σ1)

+ p(z = blue)N(x |µ2,Σ2)

+ p(z = green)N(x |µ3,Σ3)

9

Page 12: 18-661 Introduction to Machine Learning

Parameter estimation for Gaussian mixture models

The parameters in GMMs are:

θ = {ωk ,µk ,Σk}Kk=1

Let’s first consider the (unrealistic) case where we know the labels z .

Define D′ = {xn, zn}Nn=1, D = {xn}Nn=1

• D′ is the complete data

• D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = arg max∑n

log p(xn, zn)

10

Page 13: 18-661 Introduction to Machine Learning

Parameter estimation for GMMs: Complete data

The complete likelihood is decomposable across the labels:∑n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by cluster labels zn.

Let rnk ∈ {0, 1} be a binary variable that indicates whether zn = k:∑n

log p(xn, zn) =∑k

∑n

rnk log p(z = k)p(xn|z = k)

=∑k

∑n

rnk [logωk + logN(xn|µk ,Σk)]

In the complete setting, the rnk just add to the notation, but later we will

‘relax’ these variables and allow them to take on fractional values.

11

Page 14: 18-661 Introduction to Machine Learning

Parameter estimation for GMMs: Complete data

The MLE is:

ωk =

∑n rnk∑

k′∑

n rnk′, µk =

1∑n rnk

∑n

rnkxn

Σk =1∑n rnk

∑n

rnk(xn − µk)(xn − µk)>

Since rnk is binary, the previous solution is nothing but:

• ωk : fraction of total data points whose cluster label zn is k

• note that∑

k′∑

n rnk′ = N

• µk : mean of all data points whose label zn is k

• Σk : covariance of all data points whose label zn is k

12

Page 15: 18-661 Introduction to Machine Learning

Parameter estimation for GMMs: Incomplete data

GMM Parameters

θ = {ωk ,µk ,Σk}Kk=1

Incomplete Data

Our data contains observed and unobserved data, and hence is

incomplete:

• Observed: D = {xn}• Unobserved (hidden): {zn}

Goal: Obtain the maximum likelihood estimate of θ:

θ = arg max log p(D) = arg max∑n

log p(xn)

= arg max∑n

log∑zn

p(xn, zn)

The objective function is called the incomplete log-likelihood.13

Page 16: 18-661 Introduction to Machine Learning

Parameter estimation for GMMs: Incomplete data

No simple way to optimize the incomplete log-likelihood...

EM algorithm provides a strategy for iteratively optimizing this function!

E-Step: ‘Guess’ values of the zn using existing values of

θ = {ωk ,µk ,Σk}Kk=1. How do we do this?

When zn is not given, we can guess it via the posterior probability (recall:

Bayes’ rule!)

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)

=p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k ′)p(zn = k ′)

=N(xn|µk ,Σk)× ωk∑K

k′=1 N(xn|µk′ ,Σk′)× ωk′

14

Page 17: 18-661 Introduction to Machine Learning

Estimation with soft rnk

We define rnk = p(zn = k |xn)

• Recall that rnk was previously binary

• Now it’s a “soft” assignment of xn to k-th component

• Each xn is assigned to a component fractionally according to

p(zn = k |xn)

M-Step: If we solve for the MLE of θ = {ωk ,µk ,Σk}Kk=1 given soft rnks,

we get the same expressions as before!

ωk =

∑n rnk∑

k

∑n rnk

, µk =1∑n rnk

∑n

rnkxn

Σk =1∑n rnk

∑n

rnk(xn − µk)(xn − µk)>

But remember, we’re ‘cheating’ by using θ to compute rnk !

15

Page 18: 18-661 Introduction to Machine Learning

EM procedure for GMM

Alternate between estimating rnk and estimating parameters θ

• Step 0: Initialize θ with some values (random or otherwise)

• Step 1: E-Step: Set rnk = p(zn = k |xn) with the current values of θ

using Bayes Rule

• Step 2: M-Step: Update θ using the rnks from Step 2 using MLE

• Step 3: Go back to Step 1.

At the end convert rnk back to binary by setting the largest rnk for point

xn to 1 and others to 0.

16

Page 19: 18-661 Introduction to Machine Learning

GMMs and K-means

GMMs provide probabilistic interpretation for K-means

GMMs reduce to K-means under the following assumptions (in which

case EM for GMM parameter estimation simplifies to K-means):

• Assume all Gaussians have σ2I covariance matrices

• Further assume σ → 0, so we only need to estimate µk , i.e., means

K-means is often called “hard” GMM or GMMs is called “soft” K-means

The posterior rnk provides a probabilistic assignment for xn to cluster k

17

Page 20: 18-661 Introduction to Machine Learning

GMMs vs. k-means

Pros/Cons

• k-means is a simpler, more straightforward method, but might not

be as accurate because of deterministic clustering

• GMMs can be more accurate, as they model more information (soft

clustering, variance), but can be more expensive to compute

• Both methods have a similar set of practical issues (having to select

k, the distance, and the initialization)

18

Page 21: 18-661 Introduction to Machine Learning

EM Algorithm

Page 22: 18-661 Introduction to Machine Learning

EM algorithm: Motivation

EM is a general procedure to estimate parameters for probabilistic models

with hidden/latent variables.

19

Page 23: 18-661 Introduction to Machine Learning

EM algorithm: Setup

• Suppose the model is given by a joint distribution

p(x |θ) =∑z

p(x , z |θ)

• Given incomplete data D = {xn} our goal is to compute MLE of θ:

θ = arg max `(θ) = arg max log p(D) = arg max∑n

log p(xn|θ)

= arg max∑n

log∑zn

p(xn, zn|θ)

• The objective function `(θ) is called incomplete log-likelihood

• log-sum form of incomplete log-likelihood is difficult to work with

20

Page 24: 18-661 Introduction to Machine Learning

EM: Principle

• EM: construct lower bound on `(θ) (E-step) and optimize it

(M-step)

• “Majorization-minimization (MM)”

• Optimizing the lower bound will hopefully optimize `(θ) too.

21

Page 25: 18-661 Introduction to Machine Learning

Constructing a lower bound

• If we define q(z) as a distribution over z , then

`(θ) =∑n

log∑zn

p(xn, zn|θ)

=∑n

log∑zn

q(zn)p(xn, zn|θ)

q(zn)

≥∑n

∑zn

q(zn) logp(xn, zn|θ)

q(zn)

• Last step follows from Jensen’s inequality, i.e., f (EX ) ≥ Ef (X ) for

concave function f (x) = log(x).

22

Page 26: 18-661 Introduction to Machine Learning

Detour: Jensen’s inequality

Jensen’s inequality: if f (·) is a convex function, then for any random

variable X

f (EX ) ≤ Ef (X )

where equality holds when f (·) is a constant function.

• Example: for f (x) = x2 which is convex:

(EX )2 ≤ EX 2 =⇒ Var(X ) = EX 2 − (EX )2 ≥ 0

• Example: for f (x) = log x which is concave:

log(EX ) ≥ E log(X ) 23

Page 27: 18-661 Introduction to Machine Learning

Applying Jensen’s inequality

If we define q(z) as a distribution over z , then

`(θ) =∑n

log∑zn

p(xn, zn|θ)

=∑n

log∑zn

q(zn)p(xn, zn|θ)

q(zn)︸ ︷︷ ︸f (Eq(zn)[

p(xn,zn|θ)q(zn) ])

Apply Jensen’s inequality to each term, i.e., f (EX ) ≥ Ef (X ), for concave

function f (·) = log(·). We take the expectation of X = p(xn,zn|θ)q(zn)

, a

random variable depending on zn, over the probability distribution q(zn).

`(θ) ≥∑n

∑zn

q(zn) logp(xn, zn|θ)

q(zn)︸ ︷︷ ︸Eq(zn)[f (

p(xn,zn|θ)q(zn) )]

24

Page 28: 18-661 Introduction to Machine Learning

Which q(z) should we choose?

(a)

0 0.5 1

0

0.5

1

• Consider the previous model where x could be from 3 regions

• We can choose q(z) as any valid distribution

• e.g., q(z = k) = 1/3 for any of 3 colors

• e.g., q(z = k) = 1/2 for red and blue, 0 for green

Which q(z) should we choose?

25

Page 29: 18-661 Introduction to Machine Learning

Which q(z) to choose?

`(θ) ≥∑n

∑zn

q(zn) logp(xn, zn|θ)

q(zn)

• The lower bound we derived for `(θ) holds for all choices of q(·)• We want a tight lower bound, so given some current estimate θt , we

will pick qt(·) such that our lower bound holds with equality at θt .

• We will choose a different qt(·) for each iteration t.

26

Page 30: 18-661 Introduction to Machine Learning

Pick qt(zn)

Pick qt(zn) so that

`(θt) =∑n

log p(xn, zn|θt) =∑n

∑zn

qt(zn) logp(xn, zn|θt)

qt(zn)

• Pick the distribution where the equality in Jensen’s inequality holds.

• Choose qt(zn) ∝ p(xn, zn|θt)!

• Since qt(·) is a distribution, we have

qt(zn) =p(xn, zn|θt)∑

k p(xn, zn = k |θt)=

p(xn, zn|θt)

p(xn|θt)= p(zn|xn;θt)

• This is the posterior distribution of zn given xn and θt

27

Page 31: 18-661 Introduction to Machine Learning

E-Step in GMMs

GMM Parameters

θ = {ωk ,µk ,Σk}Kk=1

Incomplete Data

Our data contains observed and unobserved data, and hence is

incomplete

• Observed: D = {xn}• Unobserved (hidden) labels: {zn}

Guess the distribution of zn with the posterior probabilities, given

estimates of θt :

qt(zn) = p(zn = k |xn;θt) =p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k ′)p(zn = k ′)

=N(xn|µt

k ,Σtk)× ωt

k∑Kk′=1 N(xn|µt

k′ ,Σtk′)× ωt

k′

28

Page 32: 18-661 Introduction to Machine Learning

E- and M-Steps

Our lower bound for the log-likelihood:

`(θ) ≥∑n

∑zn

p(zn|xn;θt) logp(xn, zn|θ)

p(zn|xn;θt)

Why is this called the E-Step? Because we can view it as computing the

expected (complete) log-likelihood:

Q(θ|θt) =∑n

∑zn

p(zn|xn;θt) log p(xn, zn|θ)︸ ︷︷ ︸complete log-likelihood

= Eqt

∑n

log p(xn, zn|θ)

Where did the p(zn|xn;θt) in the denominator go? It is not a function of

θ, so can be ignored.

M-Step: Maximize Q(θ|θt), i.e., θt+1 = arg maxθ Q(θ|θt)

29

Page 33: 18-661 Introduction to Machine Learning

EM in Pictures

30

Page 34: 18-661 Introduction to Machine Learning

EM in Pictures

Adapted from Jonathan Hui’s Medium tutorial.

31

Page 35: 18-661 Introduction to Machine Learning

Iterative and monotonic improvement

`(θ) ≥∑n

∑zn

p(zn|xn;θt) logp(xn, zn|θ)

p(zn|xn;θt)︸ ︷︷ ︸Q(θ|θt )

• We can show that `(θt+1) ≥ `(θt).

• Recall that we chose qt(·) in the E-Step such that:

`(θt) = Q(θt |θt)

• However, in the M-step, θt+1 is chosen to maximize Q(θ|θt), thus

`(θt+1) ≥ Q(θt+1|θt) = maxθ

Q(θ|θt) ≥ Q(θt |θt) = `(θt)

• Note: the EM procedure converges but only to a local optimum

32

Page 36: 18-661 Introduction to Machine Learning

Example: Applying EM to GMMs

What is the E-Step in GMM?

rnk = p(z = k |xn;θt)

What is the M-Step in GMM? The Q-function is

Q(θ|θ(t)) =∑n

∑k

p(z = k |xn;θt) log p(xn, z = k|θ)

=∑n

∑k

rnk log p(xn, z = k |θ)

=∑k

∑n

rnk log p(z = k)p(xn|z = k)

=∑k

∑n

rnk [logωk + logN(xn|µk ,Σk)]

We have recovered the parameter estimation algorithm for GMMs that

we previously discussed!

33

Page 37: 18-661 Introduction to Machine Learning

Example: Estimating height distributions

Suppose the heights of men and women follow two normal distributions

with different parameters:

Men : N(µ1, σ21) Women : N(µ2, σ

22)

Our data, xn, n = 1, 2, . . . , 5 is the heights of five people: 179, 165, 175,

185, 158 (in cm).

We are missing the labels zn = each person’s gender. Let π equal the

fraction of males in the population.

How do we estimate µ1, σ1, µ2, σ2, π?

Example taken from: http://web1.sph.emory.edu/users/hwu30/

teaching/statcomp/Notes/Lecture3_EM.pdf

34

Page 38: 18-661 Introduction to Machine Learning

Example: E-Step

Initialize µ01 = 175, µ0

2 = 165, σ01 = σ0

2 = 10, π0 = 0.6.

E-Step: Estimate the probability of each person being male or female.

p(zn = male|µ1, σ1, µ2, σ2, π) =0.6 exp

[−(xn−175)2

200

]0.6 exp

[−(xn−175)2

200

]+ 0.4 exp

[−(xn−165)2

200

]p(zn = female|µ1, σ1, µ2, σ2, π) = 1− p(zn = male|µ1, σ1, µ2, σ2, π)

Person 1 2 3 4 5

xi : height 179 165 175 185 158

Prob. male 0.79 0.48 0.71 0.87 0.31

We can use these probabilities to find Q(θ1|θ0).

35

Page 39: 18-661 Introduction to Machine Learning

Example: M-Step

M-Step: Find µ11, µ

12, σ

11 , σ

12 , π

1 that maximize Q(θ1|θ0):

µ11 =

∑5n=1 p(zn = male|θ0)xn∑5n=1 p(zn = male|θ0)

, µ12 =

∑5n=1 p(zn = female|θ0)xn∑5n=1 p(zn = female|θ0)

,

σ11 =

∑5n=1 p(zn = male|θ0)(xn − µ1

1)2∑5n=1 p(zn = male|θ0)

,

σ12 =

∑5n=1 p(zn = female|θ0)(xn − µ1

2)2∑5n=1 p(zn = female|θ0)

,

π1 =1

5

5∑n=1

p(zn = male|θ0)

Here we are using the MLE solution that we derived earlier for GMMs.

Numerically, µ11 = 176, µ1

2 = 167, σ11 = 8.7, σ1

2 = 9.2, π1 = 0.63.

36

Page 40: 18-661 Introduction to Machine Learning

Example: After 15 iterations...

After iteration 1:

Person 1 2 3 4 5

x1: height 179 165 175 185 158

Prob. male 0.79 0.48 0.71 0.87 0.31

Parameter estimates: µ11 = 176, µ1

2 = 167, σ11 = 8.7, σ1

2 = 9.2,

π1 = 0.63.

After iteration 15:

Person 1 2 3 4 5

x1: height 179 165 175 185 158

Prob. male 0.999997 0.0004009 0.9991 1 2.44e-06

Final estimates: µ1 = 179.6, µ2 = 161.5, σ1 = 4.1, σ2 = 3.5, π = 0.6.

37

Page 41: 18-661 Introduction to Machine Learning

Another example: Multinomial distributions

Suppose we are trying to model the number of people who will vote for

one of four candidates. We know that the probability of a single person

voting for each candidate is:

(p1, p2, p3, p4) =

(1

2+θ

4,

1− θ4

,1− θ

4,θ

4

).

Let Y = (y1, y2, y3, y4) denote our observation of the number of votes for

each candidate. How do we estimate θ?

• Option 1: MLE. But it is hard to optimize the log-likelihood...

log p(Y |θ) = y1 log

(1

2+θ

4

)+ (y2 + y3) log(1− θ) + y4 log θ

• Option 2: EM. We will introduce two unobserved/latent variables x0and x1, where x0 + x1 = y1. In other words, y1 combines the counts

of two categories, whose individual counts are given by x0 and x1.

We have a probability 12 of picking the x0 category, and a probability

θ4 of picking the x1 category.

38

Page 42: 18-661 Introduction to Machine Learning

Applying EM to multinomial distributions

Complete likelihood function:

`(θ) = log p(X ,Y |θ) = (x1 + y4) log θ + (y2 + y3) log(1− θ)

E-Step: Find Q(θ|θt) = Eqt [`(θ)], where qt is the posterior distribution

of x1 given y1, y2, y3, y4, θ. To do this, we need to find Eqt [x1]:

x t+11 = Eqt [x1] = y1

θt

412 + θt

4

.

M-Step: Find θ that maximizes

Q(θ|θt) =

(y1

θt

412+

θt

4

+ y4

)log θ + (y2 + y3) log(1− θ).

θt+1 =x t+11 + y4

x t+11 + y4 + y2 + y3

39

Page 43: 18-661 Introduction to Machine Learning

What does this look like in practice?

Observe Y = (125, 18, 20, 34) and initialize θ0 = 0.5.

40

Page 44: 18-661 Introduction to Machine Learning

Applications of EM

EM is a general method to deal with hidden data; we have studied it in

the context of hidden labels (unsupervised learning). Common

applications include:

• Filling in missing data in a sample

• Discovering the value of latent model variables

• Estimating parameters of finite mixture models

• As an alternative to direct maximum likelihood estimation

41

Page 45: 18-661 Introduction to Machine Learning

You should know . . .

• EM is a general procedure for maximizing a likelihood with latent

(unobserved) variables

• The two steps of EM:

• (1) Estimating unobserved data from observed data and current

parameters

• (2) Using this “complete” data to find the maximum likelihood

parameter estimates

• Pros: Guaranteed to converge, no parameters to tune (e.g.,

compared to gradient methods)

• Cons: Can get stuck in local optima, can be expensive

42